Selim Incekara
Portfolio
Datathon 2024 – Entrepreneurship Foundation Score Prediction
• Developed a machine learning model to predict evaluation scores for candidates applying to the Entrepreneurship Foundation using historical data (2014-2023). • Processed a dataset containing over 10,000 applications, performing data cleaning, outlier detection, and feature engineering to enhance model performance. • Implemented and compared multiple algorithms (Linear Regression, Random Forest, XGBoost, LightGBM), with LightGBM achieving the highest score. • Obtained a Cappa evaluation metric of 7.6 and an accuracy of 84.2%, outperforming baseline models. • Conducted SHAP analysis to interpret feature importance, ensuring model transparency and fairness.
SMS Spam Detection – NLP Classification
• Developed a spam classification model to filter malicious SMS messages using the SMS Spam Collection Dataset (5,574 messages). • Preprocessed text with Tokenization, Lemmatization, and Stop Word Removal, improving text clarity and standardization. • Converted SMS messages into numerical format using TF-IDF and Bag of Words (BoW) techniques, optimizing feature representation. • Built models using Logistic Regression, Naive Bayes, and Support Vector Machines (SVM), achieving the highest accuracy with Naive Bayes. • Evaluated model performance using Accuracy (92.1%), Precision, Recall, and F1-score, ensuring high spam detection reliability. • Generated word cloud visualizations and analyzed misclassified messages, improving feature engineering strategies
Digit Recognizer
• Developed a Convolutional Neural Network (CNN) model using the MNIST dataset (60,000 training images, 10,000 test images) to recognize handwritten digits. • Achieved 93.7% accuracy by optimizing the model through data augmentation, dropout regularization, and hyperparameter tuning. • Improved model robustness by implementing batch normalization and adaptive learning rate schedules. • Visualized predictions and misclassified images using Matplotlib and Seaborn, gaining deeper insights into model performance.
IMDB 50K Movie Reviews – Sentiment Analysis
• Developed a Natural Language Processing (NLP) model to classify movie reviews as positive or negative using the IMDB 50K dataset (25,000 training, 25,000 test samples). • Preprocessed text data with Tokenization, Lemmatization, and Stop Word Removal, improving model interpretability and reducing noise. • Converted text data into numerical representations using TF-IDF and Word2Vec embeddings, enhancing feature extraction. • Trained and compared Logistic Regression, Naive Bayes, and Support Vector Machines (SVM) models, identifying SVM as the best-performing classifier. • Evaluated model performance using Accuracy (86.4%), F1-score, Precision, and Recall, ensuring a balanced classification approach. • Visualized sentiment distribution and misclassified samples with Matplotlib and Seaborn, enhancing model interpretability.
Adult Census Income – Income Prediction Model
• Data Preprocessing & Feature Engineering o Analyzed the US Census Bureau Adult dataset containing 32,561 samples and 15 features to predict whether individuals earn over $50K per year. o Identified significant features such as age, education level, occupation, and marital status through correlation analysis and chi-square tests for categorical variables. o Handled missing values using mean imputation for numerical features and mode imputation for categorical data, ensuring no data loss. o Coded categorical variables using One-Hot Encoding and Label Encoding to make them compatible with machine learning algorithms. • Model Development & Optimization o Implemented and compared Logistic Regression, Decision Tree, and Random Forest algorithms, achieving over 85% accuracy with Random Forest as the best performer. o Tuned hyperparameters using GridSearchCV and RandomizedSearchCV for model optimization, leading to improved precision, recall, and F1-score. o Applied data normalization techniques (Min-Max Scaling and Standardization) to ensure model stability and avoid bias due to different feature scales. • Performance Evaluation & Results o The final model achieved an 89% accuracy, successfully predicting income classes with high precision and recall scores. o Evaluated model performance using confusion matrix, precision-recall curve, and ROC-AUC, ensuring well-balanced performance across all income classes. o Delivered comprehensive results in a detailed report, providing insights into the socioeconomic factors influencing income levels.
Heart Failure Prediction – Advanced Machine Learning Model
• Exploratory Data Analysis (EDA) & Feature Engineering o Analyzed Heart Failure Clinical Records dataset (299 samples, 13 clinical features) to understand correlations between features and target variable o Identified key features affecting patient survival, such as ejection fraction, serum creatinine, and age, using Pearson correlation and feature importance scores. o Visualized the data using box plots, pair plots, and correlation heatmaps to detect patterns, class imbalance, and multicollinearity. o Handled missing values using imputation techniques and engineered new features like BMI categories and risk levels to enhance model performance. • Outlier Detection & Data Preprocessing o Detected and treated outliers in numerical features using IQR (Interquartile Range) method and Z-score analysis, reducing model bias. o Categorical variables were encoded using One-Hot Encoding (OHE) and Label Encoding for algorithm compatibility. o Scaled numerical features using Min-Max Scaling and Standardization, ensuring optimal performance across multiple models. • Model Development & Hyperparameter Tuning o Trained and compared Logistic Regression, Random Forest, and XGBoost to identify the most accurate classifier. o XGBoost outperformed other models with 92% accuracy, optimized using GridSearchCV and RandomizedSearchCV for hyperparameter tuning. o Evaluated model robustness using cross-validation and fine-tuned parameters such as learning rate, number of estimators, and max depth for XGBoost. • Model Fairness & Explainability o Assessed fairness across different demographic groups (age, gender, smoking status, etc.) to ensure unbiased predictions. o Applied SHAP (SHapley Additive Explanations) and LIME (Local Interpretable Model-agnostic Explanations) to interpret feature contributions in model decisions. o Calibrated model probabilities using Platt Scaling and Isotonic Regression, improving predicted risk estimations for patients. • Performance Evaluation & Results
Housing Prices Prediction – Regression Model
• Exploratory Data Analysis & Feature Selection o Developed a predictive model to estimate housing prices using Kaggle's Housing Prices dataset, which included 1,460 samples and 81 features. o Performed data cleaning by handling missing values, outlier detection (using IQR and Z-score methods), and encoding categorical features using Label Encoding and One-Hot Encoding. o Selected important features such as overall quality, square footage, neighborhood, and number of rooms using feature importance scores from decision tree-based algorithms. • Model Building & Hyperparameter Tuning o Trained and compared multiple regression models, including Linear Regression, Lasso, Ridge, XGBoost, and LightGBM, to predict house sale prices. o Achieved 90% accuracy with XGBoost, a model optimized using GridSearchCV and RandomizedSearchCV for optimal hyperparameters like learning rate, max depth, and number of estimators. o Addressed multicollinearity and overfitting by applying Ridge and Lasso regression, improving model generalization. • Performance Evaluation & Results o Evaluated the model performance using regression metrics like R-squared (0.90), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE). o Generated visual reports comparing predicted vs. actual house prices using scatter plots and error distribution histograms. o The final model predicted house prices with high precision, achieving 90% accuracy and providing valuable insights for real estate market predictions.