Since I published the Scrapy crawler code on GitHub to scrape apps data from Google Play, I periodically get emails from researchers and graduate students about the dataset. From the conversations I had so far with these researchers, I think it’s time to write another blog post and clarify some issues in the crawler and the dataset.
#1 The Scrapy script for the crawler can be found at my GitHub repository. You’re free to fork the script from there.
#2 As far as I remember, the Scrapy version I used to run this script was 0.26. Current version is probably 1.0.xx. However, no one has complained so far about any issue.
#3 Python 2.7 was used to run Scrapy, and this hasn’t changed in the new Scrapy version.
#4 Data was saved in PostgreSQL. So, anyone requesting the dataset should install PostgreSQL as well.
#5 If you find any issues in the script not scrapping the intended field from Google Play, you should look into the Xpath definitions and adjust the script accordingly.
#6 I only ran the script on Google Play U.S. store, and it’s couple of months old. If you want new data then running it from your own laptop or a server might be a good idea. It took me just 3-4 days to finish crawling using a limited resource Ubuntu server.
#7 If you are interested about the app review dataset, then there is actually no comprehensive review dataset. I only fetched review data for a few apps. In addition, the review crawler can only scrap limited number of reviews from a page. Anyone interested to build a comprehensive dataset should make the Script Ajax crawlable.
If you need the original dataset or need any further information, feel free to contact me at [email protected]