class: middle ## Global Software Engineering in Open Source and its Effect on Quality @ [WSSE22](https://cs.hse.ru/wsse/)
Helge Pfeiffer, Assistant Professor,
[Research Center for Government IT](https://www.itu.dk/forskning/institutter/institut-for-datalogi/forskningscenter-for-offentlig-it),
[IT University of Copenhagen, Denmark](https://www.itu.dk)
`ropf@itu.dk` --- # Who am I? * Dipl-Inf. in Software Engineering from Friedrich-Schiller Universität Jena * PhD in Software Engineering from ITU * Software engineer at the Danish Meteorological Institute * Lecturer at Cphbusiness * Since Jan. 2019 Assist. Prof. at ITU, working on Software Quality, Software Quality Metrics, and Technical Debt
--- # What are we doing today? * Discussion of the paper that you read * Brief recap of branching strategies * Mini research project in groups - we investigate the key result of the paper ---
## Learning Objectives After this session the student will be able to: * Understand how VCS histories and issue tracker data is collected for research. * Analyze the history from Git repositories and the Jira issue tracker. * Apply scripts and programs in various languages to clean and pre-process the exported data. * Create scripts and programs to analyze Git VCS and Jira issue tracker data to investigate certain research questions. * Interpret analysis results to either better understand current practices or to suggest actionable changes of current practices in software engineering. --- ## You all read the paper: * E. Shihab et al. ["The Effect of Branching Strategies on Software Quality"](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/shihab-esem-2012.pdf) --- ## Discussion #### What did you take away from the paper? * ... * ... --- ## Discussion #### Which research method is used in the paper? * ... * ... --- ## Discussion #### What are the key results? * ... * ... --- ## Discussion #### Observations * ... * ... --- ## Branching in Git? -- ### What is a Branch? * Git stores as a series of snapshots.
--- ### What is a Branch? * Amongst others, commit objects contain a pointer to zero or more direct parents of this commit: * First commit: _zero_ parents * Normal commit: _one_ parent * Merging commit: _multiple_ parents
See for example
Chapter 3. Git Branching
and
Chapter 10. Git Internals
in the
Pro Git book
.
--- ### What is a Branch? A _branch_ in Git is simply a pointer to a commit. The default branch name is `master`. Every time you commit, the pointer moves forward automatically.
--- ### Branching Strategies? > “We are a team of four senior developers (by which I mean we’re all over 40 with 20+ years each of development experience) and not one of us has had a positive experience in the past with branching the mainline... The branch is easy - it’s the merge at the end that’s painful”. > > [Phillips et al. _"Branching and Merging: An Investigation into Current Version Control Practices"_](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.278.6065&rep=rep1&type=pdf) --- #### Long-Running Branches/Branch by Release ![](http://git-scm.com/figures/18333fig0318-tn.png) ![](http://git-scm.com/figures/18333fig0319-tn.png) --- #### Topic Branches/Branch by Purpose A topic branch is a short-lived branch that you create and use for a single particular feature or related work. ![](http://git-scm.com/figures/18333fig0320-tn.png) --- #### Topic Branches/Branch by Purpose After merging `iss91v2` and `dumbidea` and throwing away the original `iss91` branch (losing commits C5 and C6) the repository's history looks as in the following:
--- #### Git-Flow Read on it here: http://nvie.com/posts/a-successful-git-branching-model/
--- #### No Branches/Trunk-based Development > Almost all development occurs at the “head” of the repository, not on branches. This helps identify integration problems early and minimizes the amount of merging work needed. It also makes it much easier and faster to push out security fixes. > > Henderson [_"Software Engineering at Google"_](https://arxiv.org/pdf/1702.01715.pdf) -- ##### Release Branches
--- ##### No Branches
> A release typically starts in a fresh workspace, by syncing to the change number of the latest “green” build (i.e. the last change for which all the automatic tests passed), and making a release branch. The release engineer can select additional changes to be “cherry-picked”, i.e. merged from the main branch onto the release branch. Then the software will be rebuilt from scratch and the tests are run. If any tests fail, additional changes are made to fix the failures and those additional changes are cherry-picked onto the release branch, after which the software will be rebuilt and the tests rerun. When the tests all pass,the built executable(s)and data file(s) are packaged up. All of these steps are automated so that the release engineer need only run some simple commands, or even just select some entries on a menu-driven UI, and choose which changes (if any) to cherry pick. > > Henderson [_"Software Engineering at Google"_](https://arxiv.org/pdf/1702.01715.pdf) --- ## Which Branching Model does Microsoft Use? In the paper [_"The Effect of Branching Strategies on Software Quality"_](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/shihab-esem-2012.pdf) the authors describe how Microsoft used branches when building Windows 7 and Windows Vista. Which branching model is it? * ... * ... --- ## Let's do some research! ![](https://thumbs.gfycat.com/WhirlwindGrimyAztecant-max-1mb.gif) --- ## Relation between branching and SW quality? > We find that, indeed, branching does have an effect on software quality... > > E. Shihab et al. ["The Effect of Branching Strategies on Software Quality"](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/shihab-esem-2012.pdf) * Is that true for open-source projects too? * That is, for software that is developed in a globally distributed manner, can we find an effect of branching on defects? --- ## Research Question #### Is there a correlation between branching activity and software quality? -- * _Branching activity_: Let's consider it as the amount of merge commits. This is what Shihab et al. and Phillips et al. describe as the real issue. * _Software quality_: Let's consider the amount of reported defects as proxy for software quality. That is, we consider tickets that are labeled as bugs in an issue tracker as defect reports. --- ## What do we need to investigate the RQ? - a) ... - b) ... - c) ... - d) ... --- ## What do we need to investigate the RQ? ### a) Some case systems I suggest that we have a look at some systems from the Apache Software Foundation, e.g., * [Apache Airflow](https://github.com/apache/airflow) * [Apache Camel](https://github.com/apache/camel) * [Apache Ignite](https://github.com/apache/ignite) * ... Find a complete list of systems [here](https://github.com/HelgeCPH/ase-effect-branching-on-defects#groups). --- ## What do we need to investigate the RQ? ### b) Defect Reports Most Apache projects use JIRA as issue tracker. Let's use tickets labeled as _bug_ as a proxy for software quality. * [Apache Airflow JIRA](https://issues.apache.org/jira/projects/AIRFLOW/issues) * [Apache Camel JIRA](https://issues.apache.org/jira/projects/CAMEL/issues) * [Apache Ignite JIRA](https://issues.apache.org/jira/projects/IGNITE/issues) Let's count the **number of defect reports per week**. --- ## What do we need to investigate the RQ? ### c) History with merge commits of each system Most Apache projects are hosted or mirrored on Github. * [Apache Airflow](https://github.com/apache/airflow) * [Apache Camel](https://github.com/apache/camel) * [Apache Ignite](https://github.com/apache/ignite) Let's count the **number of merge commits per week**. --- ## What do we need to investigate the RQ? ### d) Ability to compute correlation statistics and plotting Is there a correlation between the **number of defect reports per week** and the **number of merge commits per week**? How do the **number of defect reports per week** and the **number of merge commits per week** look when plotted in a scatter plot? Likely Spearman's ρ is perhaps a good correlation coefficient to compute. --- ## What do we need to investigate the RQ? - a) Some case systems - b) Defect Reports - c) History with merge commits of each system - d) Ability to compute correlation statistics and plotting --- ## What do we need to investigate the RQ?
--- ### b) Defect Reports???
--- ### b) Defect Reports??? From the JIRA API via `curl` in the terminal. ```bash curl "https://issues.apache.org/jira/"` `"rest/api/2/search?jql=project=AIRFLOW+order+by+created"` `"&issuetypeNames=Bug&fields=id,key,priority,labels,versions,"` `"status,components,creator,reporter,issuetype,description,"` `"summary,resolutiondate,created,updated" ``` See JIRA API docs: * https://developer.atlassian.com/server/jira/platform/jira-apis-32344053/ * https://docs.atlassian.com/DAC/rest/jira/6.1.html --- ### b) Defect Reports??? Via a tool ([`requests` documentation](https://docs.python-requests.org/en/latest/)) that wraps HTTP requests nicely: ```python import requests jira_api_base_url = "https://issues.apache.org/jira/" page_length = 500 project = "airflow" proj_jira_id = project.upper() start_idx = 0 url = ( jira_api_base_url + f"rest/api/2/search?jql=project={proj_jira_id}+order+by+created" + f"&issuetypeNames=Bug&maxResults={page_length}&" + f"startAt={start_idx}&fields=id,key,priority,labels,versions," + "status,components,creator,reporter,issuetype,description," + "summary,resolutiondate,created,updated" ) print(f"Getting data from {url}...") r = requests.get(url) r_dict = r.json() ``` --- ### c) History with merge commits of each system???
--- ### c) History with merge commits of each system??? Using `git` directly. ```bash git clone git@github.com:apache/airflow.git git -C ./airflow log --graph --oneline --all ``` You may want to use [this lecture material on analyzing `git log` output](https://github.com/HelgeCPH/2019_ase_behavioural_analysis). --- ### c) History with merge commits of each system??? Via a tool ([PyDriller documentation](https://pydriller.readthedocs.io/en/latest/)) that wraps `git` nicely: ```python import os from pydriller import Repository repo_name = "airflow" gh_repo = f"https://github.com/apache/{repo_name}.git" repo_path = os.path.join("data", "repositories", repo_name) repo = Repository(path_to_repo=gh_repo, clone_repo_to=repo_path) for commit in repo.traverse_commits(): print(f"{commit.hash[:10]}") ``` --- ### d) Ability to compute correlation statistics and plotting???
--- ### d) Ability to compute correlation statistics and plotting??? Using a plotting tool like `matplotlib` ([official documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html?highlight=scatter#matplotlib.pyplot.scatter)) directly. ```python import csv import matplotlib.pyplot as plt x_data = [2, 6, 12, 1, 23, 10, 8] # merges per week y_data = [5, 9, 3, 7, 11, 4, 7] # defects per week # Plot plt.scatter(x_data, y_data) plt.title("Project name") plt.xlabel("merges per week") plt.ylabel("defects per week") plt.savefig("plot.png") ```
--- ### d) Ability to compute correlation statistics and plotting??? Likely Spearman's ρ is a good correlation coefficient to compute ```python from scipy import stats x_data = [2, 6, 12, 1, 23, 10, 8] # merges per week y_data = [5, 9, 3, 7, 11, 4, 7] # defects per week ρ, p = stats.spearmanr(x_data, y_data) print(ρ, p) ``` ``` -0.01801874925391118 0.9694153868031987 ``` See the documentation of `scipy`'s implementation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html --- ### d) Ability to compute correlation statistics and plotting??? Via a tool ([`pandas` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html)) that are meant for data analysis: ```python import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv("frequencies_per_week.csv") df["created_week"] = pd.to_datetime(df["created_week"], utc=True) corr_index = df.corr(method="spearman")["bugs_per_week"]["merges_per_week"] title = f"Project name (ρ={corr_index:3f})" df.plot(x="merges_per_week", y="bugs_per_week", kind="scatter", title=title) plt.savefig("plot.png") ```
--- ### Assumption The previous example assumes that we have a CSV file `frequencies_per_week.csv` with contents like: ``` created_week,merges_per_week,bugs_per_week 012020,2,5 022020,6,9 032020,12,3 042020,1,7 052020,23,11 062020,10,4 072020,8,7 ``` --- ### What do we need to investigate the research question?
-- Find example data, and code fragments in the repository of this session: https://github.com/HelgeCPH/ase-effect-branching-on-defects --- class: middle # What I would like you to do now... --- ## You work together in groups of three
--- ## You work together in groups of three
--- ## You work together in groups of three
--- ## Each group works on one case system
--- ## Each group works on one case system
--- ## Each group works on one case system
--- class: middle
--- ### One of you
--- ### One of you converts JIRA tickets
--- ### One of you converts JIRA tickets into a CSV file
--- ### One of you
--- ### One of you converts Git merge commits
--- ### One of you converts Git merge commits into a CSV file
--- ### One of you
--- ### One of you combines the CSV files into weekly statistics
--- ### ...plots them and computes correlation coefficient
--- ### Nothing is working yet? --- ### Nothing is working yet? Don't despair! ![](https://c.tenor.com/yxezf5RJR3oAAAAC/squirrel-scared.gif) --- class: middle
--- ### Working in groups on the same tasks
--- ### Solving the same tasks in groups
--- class: middle
--- ### Back to original groups
--- ### Back to original groups, combine solutions
--- ### Back to original groups, combine solutions, and create results
--- class: middle
--- ## Interpretation and Discussion of Results
--- ## Interpretation and Discussion of Results
- Do our results support the findings of Shihab et al. or do they diverge from their findings? -- - What is the meaning of our results? -- - How trustworthy are they? --- class: middle # Group work, now! -> Have fun!
* Find example data, and code fragments in the repository of this session: https://github.com/HelgeCPH/ase-effect-branching-on-defects * In case you need a shared editor, you might create one for this session on https://demo.firepad.io * I will try to join some of the breakout rooms. In case you need help, write to the chat which room I should join. * After round 3, please share the plots from your group via the chat. --- ## Results
--- ## Interpretation and Discussion of Results
- Do our results support the findings of Shihab et al. or do they diverge from their findings? - What is the meaning of our results? - How trustworthy are they? --- ## Writing up the Results In case you feel like executing this project properly and if you feel like writing it up into a short-paper of conference quality, make use of the remaining time of the winter school and feel free to contact me. If you want to write your theses eventually and if you are interested in this or related topics, please feel free to contact me too. --- class: middle ## Global Software Engineering in Open Source and its Effect on Quality @ [WSSE22](https://cs.hse.ru/wsse/) ### Thank you for having me and for working together.
Helge Pfeiffer, Assistant Professor,
[Research Center for Government IT](https://www.itu.dk/forskning/institutter/institut-for-datalogi/forskningscenter-for-offentlig-it),
[IT University of Copenhagen, Denmark](https://www.itu.dk)
`ropf@itu.dk`