Create Your First Project
Start adding your projects to your portfolio. Click on "Manage Projects" to get started
Scalable Persian Business Data Extraction with Custom NER Pipeline
Project type
Scrapy, Named-Entity Recognition (NER), Machine Learning, Natural Language Processing, Hugging Face
Developed an end-to-end solution to scrape, process, and extract structured data from millions of public business records published in Persian. Leveraging Scrapy, I efficiently identified dataset URL ranges using a binary search algorithm and orchestrated a cluster of Scrapy spiders hosted on AWS for large-scale data extraction.
To process the raw text, I designed a custom pipeline to generate a labeled NER (Named Entity Recognition) training set, leveraging Google Translation Services and ChatGPT 4 for initial bootstrapping. A pre-trained Persian NER model from Hugging Face was fine-tuned iteratively to achieve high accuracy.
The final trained model was deployed for batch inference, enabling the extraction of specified fields from the complete dataset. The structured output was delivered in JSON-Lines format, optimized for downstream analysis.
This project demonstrated expertise in web scraping, AWS infrastructure, NLP model fine-tuning, and scalable data processing workflows.





