Recognizing Handwritten Digits with scikit-learn

 


Recognizing handwritten text is a problem that can be traced back to the first automatic machines that needed to recognize individual characters in handwritten documents. Think about, for example, the ZIP codes on letters at the post office and the automation needed to recognize these five digits. Perfect recognition of these codes is necessary in order to sort mail automatically and efficiently. Included among the other applications that may come to mind is OCR (Optical Character Recognition) software. OCR software must read the handwritten text, or pages of printed books, for general electronic documents in which each character is well defined


Hypothesis :

The Digits data set of the Scikit-learn library provides numerous data sets that are useful for testing many problems of data analysis and prediction of the results. Some Scientist claims that it predicts the digit accurately 95% of the time. Perform data analysis to accept or reject this Hypothesis.


Step 1: Dataset 

In this project, we are using the Handwritten Digits dataset which is already ready in the sklearn library. we can import the dataset using the below code.




We focus mainly on data and targets. We extract both on different variables.



Now we can see our data look.


Step:2 Model Planning


1. Support Vector Classifier :





As we can see we use very high data for training and very little data for training and the Support Vector Classifier do a very good job on data and we get 100 % accuracy on test data.


2. Decision Tree Classifier :




Now for this time, we use different sizes of training data and validation data. As we can see Decision Tree classifier performs poor performance on the data. we can increase the accuracy by fine-tuning the hyperparameters of DTC.

3. Random Forest Classifier :



As we can see Random Forest performs excellently with fewer data compared to both the Decision tree and Support vector. We get a 92 % Accuracy score with Random Forest Classifier.


Conclusion 

As per our hypothesis, we can say with hyperparameter tuning with different machine learning models or using more data we can achieve near 95% accuracy on the handwritten dataset. But make sure we also have a good amount of test data otherwise the model will get overfit. Data matters the most we need a good amount of data for the model. if we have less data then we can use some other machine learning classifier algorithms like a random forest which also gives 92 % accuracy on 1500 trainset which is less data compared to the Support vector classifier.


I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Experience. Thank you www.suvenconsultants.com

Comments