View / download Resume (last updated on 02-May-2025)
About
As an Azure Data Engineer at Capgemini, I develop end-to-end data engineering solutions like EDH, ODS, Data Marts and Data Lakes. I am experienced in creating big data ETL pipelines with medallion architecture in Azure using Python, SQL and PySpark. I migrated on-prem processes to cloud, implemented CDC process with SCD Type 2, improved pipelines’ reusability by developing metadata-driven architecture, performed data cleaning and transformations, and created PII-masked views and extracts as per business requirements. I also automated manual and repetitive tasks like creating SQL queries and directory structures, optimized pipelines and Python programs to reduce average runtime, and worked on dynamic real-time status and metadata tracking of ETL job extracts using Python.
I hold active “Microsoft Certified: Azure Data Engineer Associate” (DP-203) and “AWS Certified Cloud Practitioner” (CLF-C02) certificates, and have a strong background in Data Science and Machine Learning. I also completed my M.Tech. (CSE) from BIT, Mesra, with a thesis focused on this domain.
I am passionate about finding solutions for individual and organizational growth, and focus on continuously improving and utilizing my skills. I collaborate with my team and clients to deliver high-quality results and value. I am always eager to learn new technologies and tools, and to apply them to solve real-world problems.
Skills
- Programming Languages: SQL, Python (PySpark)
- Azure Services: Synapse Analytics, Databricks, Data Factory, SQL Database, Data Lake Storage, Logic Apps
- Big Data Engineering: Apache Spark, ETL Development
- Databases: Azure SQL Database, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, SQLite
Certifications
- Passed “Microsoft Certified: Azure Data Engineer Associate” (DP-203) certification exam of Microsoft (View Certificate)
- Passed “AWS Certified Cloud Practitioner” (CLF-C02) certification exam of Amazon Web Services (View Certificate)
Professional Experience
- Current Role: Azure Data Engineer
- Current Designations: Associate Consultant
- Organization: Capgemini
- Duration: March 2022 - Present
Data Engineering
End-to-end development of Enterprise Data Hub, Operational Data Store, Data Marts and Data Lakes with Views and Extracts generation
- Migrated on-prem big data ETL processes to cloud, by creating storage event and schedule triggered pipelines with medallion architecture in Azure using Python, SQL and PySpark
- Implemented change data capture (CDC) process to store transformed data with SCD Type 2 implementation
- Improved reusability by developing metadata-driven architecture to create dynamic pipelines, which selectively fetch data by joining required source tables and applying transformations, to generate PII-masked views and extracts as per business requirements
- Optimized pipelines by applying conditional activity executions to reduce average runtime by 38%
- Identified and automated the manual and repetitive tasks by developing dynamic Python scripts to generate SQL queries and create directory structures
- Implemented pre-load, data quality and data control checks
- Performed data cleaning and applied transformations on Parquet and CSV big data feeds
- Implemented status email notification functionality in pipelines using Azure Logic Apps and Web Activities
- Improved fault tolerance by identifying and covering multiple edge cases
Software Engineering
Status and Metadata Reports Generation
- Dynamic real-time status and metadata tracking of ETL job extracts using Python
- Extracted metadata properties and row counts dynamically from DAT and TXT files
Miscellaneous
- Optimised Python programs to reduce average runtime by 23%
- Automated Excel macro runs by creating Python scripts, to email daily consolidated status reports
Education
Birla Institute of Technology, Mesra
- Degree: Master of Technology
- Branch: Computer Science and Engineering
- CGPA: 8.06
- Duration: July 2018 to July 2020
Thesis Work
Title: Diabetes Prediction using Machine Learning (View on GitHub)
Languages: Python (NumPy, Pandas, Matplotlib, Seaborn, scikit-learn / sklearn), Markdown
Software: Jupyter Notebook (Anaconda)
- Achieved up to 81.6% accuracy in Diabetes prediction on Pima Indians Diabetes Database with Random Forest classifier
- Applied and analysed the accuracies of “K-Nearest Neighbors, Support Vector Machine, Decision Tree and Random Forest” classification algorithms for diabetes prediction
- Achieved up to 7.04% improvement in the accuracy of the Decision Tree classification algorithm for Diabetes prediction
- Predicted missing values present in the dataset using a set of “Linear Regression, Support Vector Regression, Decision Tree and Random Forest” regression algorithms
- Performed dataset balancing using SMOTE algorithm and then Feature scaling
Project Work
Title: CoWIN Vaccine Notifier (View on GitHub)
Languages: Python, Markdown
Software: Jupyter Notebook (Anaconda)
- Developed a real-time Covid vaccine availability tracker using Python to notify the user as soon as any desired vaccine is available on the CoWIN website for booking
- Implemented 4 dynamic filters on the vaccination calendar received as a JSON response from Co-WIN API
- Helped more than 30 people to get Covid vaccines on time
Achievements
- Earned Gold badge (5 Stars) for SQL on HackerRank (Visit Profile)
- Achieved 2nd runner-up position among 215 teams in i3i 2023, a Capgemini hackathon for insurance domain (View Certificate)
- Secured rank 6,888 among 1,08,495 candidates in GATE (CS) exam organised by ‘Indian Institute of Science (IISc), Bangalore’
Profiles
- LinkedIn: ShubhanshuTrip
- GitHub: ShubhanshuTripathi
- HackerRank: ShubhanshuTrip
Contact
- Email: [email protected]
- LinkedIn: ShubhanshuTrip