View / download Resume (last updated on 28-Jan-2025)
About
As an Azure Data Engineer at Capgemini, I create end-to-end big data ETL pipelines with medallion architecture in Azure using Python, SQL and PySpark. I migrated on-prem processes to cloud, implemented data quality and control checks, performed data cleaning and transformations, and created PII-masked views and extracts as per business requirements. I also automated the manual and repetitive tasks, optimized pipelines and Python programs to reduce average runtime, and worked on dynamic real-time status and metadata tracking using Python.
I hold active “Microsoft Certified: Azure Data Engineer Associate” (DP-203) and “AWS Certified Cloud Practitioner” (CLF-C02 / CLF-C01) certificates, and have a strong background in Data Science and Machine Learning. I also completed my M.Tech. in Computer Science and Engineering from BIT, Mesra, with a thesis focused on this domain.
I am passionate about finding solutions for individual and organizational growth, and focus on continuously improving and utilizing my skills. I collaborate with my team and clients to deliver high-quality results and value. I am always eager to learn new technologies and tools, and to apply them to solve real-world problems.
Skills
- Python
- SQL (Microsoft SQL Server, MySQL)
- PySpark (Apache Spark), Big Data
- ETL (Extract, Transform, Load) development
- Azure (Synapse Analytics, Databricks, Data Factory, Data Lake Storage, SQL Database, Logic Apps), Amazon Web Services (AWS)
- Git / GitHub
- Data Science, Machine Learning
Certifications
- Passed “Microsoft Certified: Azure Data Engineer Associate” (DP-203) certification exam of Microsoft (View Certificate)
- Passed “AWS Certified Cloud Practitioner” (CLF-C02 / CLF-C01) certification exam of Amazon Web Services (View Certificate)
- Passed “Basics of Data Science and Machine Learning” course from Coding Ninjas (View Certificate)
Professional Experience
- Current Role: Azure Data Engineer
- Current Designations: Associate Consultant
- Organization: Capgemini
- Duration: March 2022 - Present
Data Engineering
End-to-end development of Operational Data Store, Data Hub, Data Marts and Data Lakes with Views and Extracts generation
- Migrated on-prem big data ETL processes to cloud, by creating storage event and schedule triggered pipelines with medallion architecture in Azure using Python, SQL and PySpark
- Implemented pre-load, data quality, and data control checks
- Performed data cleaning and applied transformations on parquet and CSV big data feeds
- Implemented change data capture (CDC) process to store transformed data with SCD Type 2 implementation
- Developed Data Marts by creating dynamic pipelines to selectively fetch data by joining multiple source tables and apply transformations, to generate PII-masked views and extracts as per business requirements
- Implemented dynamic pipeline status email notification functionality using Azure Logic Apps and Web Activity
- Optimized pipelines by applying conditional activity executions to reduce average runtime by 38%
- Identified and automated the manual and repetitive tasks (like SQL queries creation) to save team’s time and efforts
- Identified and covered multiple edge cases to create a more fault-tolerant system
Software Engineering
Status and Metadata Reports Generation
- Dynamic real-time status and metadata tracking using Python
- Extracted metadata properties and row counts dynamically from DAT and TXT files
Miscellaneous
- Optimised Python programs to reduce average runtime by 23%
- Automated Excel macro runs by creating Python scripts, to email daily consolidated status reports
Education
Birla Institute of Technology, Mesra
- Degree: Master of Technology
- Branch: Computer Science and Engineering
- CGPA: 8.06
- Duration: July 2018 to July 2020
Thesis Work
Title: Diabetes Prediction using Machine Learning (View on GitHub)
Languages: Python 3, Markdown
Software: Jupyter Notebook (Anaconda)
- Achieved up to 81.6% accuracy in Diabetes Prediction on Pima Indians Diabetes Database with Random Forest classifier
- Applied and analysed accuracies of “K-Nearest Neighbors, Support Vector Machine, Decision Tree and Random Forest” classification algorithms for diabetes prediction
- Achieved up to 7.04% improvement in the accuracy of Decision Tree classification algorithm for Diabetes Prediction
- Predicted missing values present in the dataset using a set of “Linear Regression, Support Vector Regression, Decision Tree and Random Forest” regression algorithms
- Performed Dataset Balancing using SMOTE algorithm and then Feature Scaling
Project Work
Title: CoWIN Vaccine Notifier (View on GitHub)
Languages: Python 3, Markdown
Software: Jupyter Notebook (Anaconda)
- Developed a Python notebook to notify the user, as soon as any desired Covid Vaccine is available on CoWIN website for booking
- Implemented 4 dynamic filters on the Vaccination calendar received as a JSON response from Co-WIN API
- Helped more than 30 people to get Covid Vaccines using this notifier
Achievements
- Achieved 2nd runner up position among 215 teams in i3i 2023, a Capgemini hackathon for insurance domain (View Certificate)
- Secured rank 6,888 among 1,08,495 candidates in GATE (CS) exam organised by ‘Indian Institute of Science (IISc), Bangalore’
Profiles
- LinkedIn: ShubhanshuTrip
- GitHub: ShubhanshuTripathi
Contact
- Email: [email protected]
- LinkedIn: ShubhanshuTrip