Google Summer of Code 2019: Improve Article Recommendation Pipeline
GSoC Project Proposal: Improve Article Recommendation Pipeline
Mentor: Bahodir Mansurov
Synopsis
The project improved the article recommendation pipeline by solving the various issues in the article-recommender projects. The issues that were solved as a part of the project are:
- Remove duplicate Wikidata items from article recommendations
- Recommendation API translation endpoint stopped working
- Article recommendation API: replace WDQS with MW API
- morelike recommendation API: Bulk import data to MySQL in chunks
Links to code changes
Merged:
- Remove duplicate Wikidata items from article recommendations.
- Throw appropriate error when wmAPI returns internal server error.
- Splits the request to MediaWiki API and Wikidata query service in batches.
- Replace Wikidata Query Service with MediaWiki API
- Bulk import data to MySQL in chunks
Outcome
Each task in the project had a noticeable outcome.
The first task made sure that the data returned by the morelike endpoint
did not have duplicate data.
This ensured that the data we got from a single call to the API contained more wikidata items than what was being returned previously.
The second task made sure that the translation endpoint
does not fail intermittently without returning a proper error code.
This helps us with debugging when some error arises. It also made sure that the API does not fail due to an error with the Wikidata Query Service.
The third task replaced the internal call that was being made to Wikidata Query Service(WDQS) with a call to the MediaWiki API(MWAPI) when
the translation endpoint
is called. This decreased the number of requests being made internally thereby decreasing the time required by the
translation endpoint
. It also decreased the time required by replacing the slower WDQS with a faster MWAPI.
The fourth task made sure that the script used to import the data generated by Hadoop into the database does not block the CPU. It improves the CPU efficiency of the shared machines and allows to run other CPU tasks as well without them getting blocked.
Acknowledgement
I would like to thank my mentor, Bahodir Mansurov for helping me throughout the program, encouraging me with great feedback and guiding me towards project completion.
Thoughts about the project and more..
This was the first time I was formally going to work on a “real” project. I have been contributing to open source since more than 2 years but they have mostly been one off contributions. This was also the first time I was writing code for an organization as big as Wikimedia. I was excited to work with the other developers. Everyone I interacted with from the Wikimedia team was helpful and more than willing to go out of the way to help a beginner like me. It felt wonderful working along with my mentor and all the other people in the team. A big thank you to all the fellow developers who helped me with the project and also to the Wikimedia Foundation and Google for providing me this wonderful opportunity.