PyGitGraph¶
In this blog post, I introduce PyGitGraph, an open-source tool I developed to work with GitHub Issues.
This automated tool allows you to extract full details of 1000s of Github Issues using Python & GraphQL.
Github Issues¶
Github Issues are fantastic!
They enable you to track tasks, manage code defects, plan projects, and much more.
See below for some examples on how Github Issues help you to manage complex tasks and projects.
Features of Github Issues
Track your tasks in a Kanban board.
Track your tasks in a tabular view.
Visualize your project roadmap in a Gantt chart.
Precious information¶
Over time, a lot of critical information about your project or product is captured within GitHub issues.
Would it not be cool, if there was an easy way to extract all this information and put it all neatly in a csv or a json?
Such an extract would allow you to find things like:
- what are the most frequently logged issues,
- what is the average duration for issue closure,
- perform sentiment analysis for comments inside issues, etc.
We could go ahead and build graphs like the ones below.
Graphs of Github Issue details
A heatmap showing the number of issues opened each day.
A Line chart showing the number of issues created and closed in January 2023.
A Bar chart showing the number of issues logged for each label.
A bar chart displaying the average time taken to close issues for each label.
A word cloud highlighting words from the titles of GitHub Issues.
Extraction challenges¶
Building graphs or training machine learning models requires a large amount of data. However, extracting substantial amounts of data about GitHub Issues is not easy.
GitHub provides two options for data extraction - REST API and GraphQL API.
Using GraphQL is more advantageous than the REST API for the following reasons:
- Precise Data Retrieval - With GraphQL, you receive exactly the data you request, mirroring your query schema. In contrast, the REST API often returns more data than needed in a fixed format that cannot be changed.
- Efficiency of Retrieval - GraphQL allows you to obtain all required data in a single request, whereas the REST API needs multiple requests for the same data. This means, GraphQL needs lesser number of server trips, making your application faster and more efficient.
There are additional benefits of using GraphQL over REST API, detailed by GitHub here.
In summary, the GraphQL API provides an efficient way to extract information from GitHub.
However, tools integrating with the GitHub GraphQL API are scarce.
This gap inspired me to develop a suite of Jupyter notebooks designed to help you extract, create, close, and delete GitHub issues using GraphQL.
Features of PyGitGraph¶
In a nutshell these are the features of the PyGitGraph tool that I have developed.
-
Manage Github Issues
- Create issues from jupyter notebooks
- Include labels, body, assignee, etc.
- Full markdown support (lists, checkboxes, etc.)
- Work with private/public repositories
-
Analyze historical trends
- Download 1000s Github issues
- Paginate 100 records at a time
- Flatten nested structures (json to csv)
- Create historical data extracts
- Build visualizations using bulk extracts
- Visualize trends, build heatmaps
- Issue closure rates, issue duration, etc.
- Use tools of your choice
-
Easy, but Secure
- Also works with Github Enterprise
- Run any GraphQL query you want
- Customize all query parameters
- Support for Python venv & VSCode DevContainers
- Github tokens stored locally
- Work with fine-grained tokens
- Understand rate and node limits
- Check query usage
-
Many use cases
- Issue trend analysis and forecasting
- Contributor activity and performance metrics
- Label-based issue categorization and prioritization
- Sentiment analysis with NLP models
- Build forecasting models
- Most issues logged when
- Issue closure rates
- Duration of issues open
Your thoughts¶
I would like to know your thoughts on this tool. Connect with me on LinkedIn or drop me an email at [email protected].
The entire code is MIT licensed. Find it here.