Scraping FRC team's GitHub accounts to gather large amounts of data

There are a lot of teams...

I was curious about the most used languages for FRC, so I build a Python script to find out what they where.

Some basic data

Before we get to the heavy work done by my script, let’s start with some general data.

Thanks to the TBA API, I know that there are 6917 registered teams. 492 of them have registered at least one account on GitHub.

How the script works

The script is split into steps:

Get a list of every registered team
Check for a github account attached to every registered team
- If a team has an account, it is added to the dataset
Load each github profile
- If it is a private account, move on
- Use Regex to find all languages used
Compile data and sort

Getting a list of accounts

This is probably the simplest step in the whole process. I used the auto-generated tbaapiv3client python library’s get_teams_keys(key) function, and kept incrementing key until I got an empty array. All returned data was then added together into a big list of team keys.

Checking for a team’s github account

The TBA API helpfully provides a /api/v3/team/<number>/social_media API endpoint that will give the GitHub username for any team you request. (or nothing if they don’t use github)

A for loop on this with a list of every team number did the trick for finding accounts.

Fetching language info

To remove the need for an Oauth login to use the script, GitHub data is retrieved using standard HTTPS requests instead of AJAX requests to the API. This gets around the tiny rate limit, but takes a bit longer to complete.

To check for language usage, a simple Regex pattern can be used: /programmingLanguage"\>(.*)\</gm

When combined with an re.findall(), this pattern will return a list of all recent languages used by a team.

Data saves / backup solution

To deal with the fact that large amounts of data are being requested, and people might want to pause the script, I have created a system to allow for “savestates”.

On launch of the script, it will check for a ./data.json file. If this does not exist, one will be created. Otherwise, the contents will be read. This file contains both all the saved data, and some counters.

Each stage of the script contains a counter, and will increment the counter every time a team has been processed. This way, if the script is stopped and restarted, the parsers will just keep working from where they left off. This was very helpful when writing the script as, I needed to stop and start it every time I needed to implement a new feature.

All parsing data is saved to the json file every time the script completes, or it detects a SIGKILL.

What I learned

After letting the script run for about an hour, I got a bunch of data from every registered team.

This data includes every project (both on and offseason) from each team, so teams that build t-shirt cannons using the CTRE HERO, would have C# in their list of languages. Things like that.

Unsurprisingly, by far the most popular programming language is Java, with 3232 projects. These projects where all mostly, or entirely written in Java. Next up, we have C++ with 725 projects, and Python with 468 projects.

After Java, C++, and Python, we start running in to languages used for dashboards, design, lessons, and offseason projects. Before I get to everything else, here is the usage of the rest of the valid languages for FRC robots:

C (128)
LabView (153)
Kotlin (96)
Rust (4)

Now, the rest of the languages below Python:

occurrences of JavaScript
occurrences of LabVIEW
occurrences of C
occurrences of Kotlin
occurrences of Arduino
occurrences of C#
occurrences of CSS
occurrences of PHP
occurrences of Shell
occurrences of Ruby
occurrences of Swift
occurrences of Jupyter Notebook
occurrences of Scala
occurrences of D
occurrences of TypeScript
occurrences of Dart
occurrences of Processing
occurrences of CoffeeScript
occurrences of Go
occurrences of Groovy
occurrences of Objective-C
occurrences of Rust
occurrences of MATLAB
occurrences of R
occurrences of Visual Basic
occurrences of Clojure
occurrences of Cuda

I have removed markup and shell languages from that list because most of them are probably auto-generated.

In terms of github account names, 133 teams follow FRC convention and use a username starting with frc, followed by their team number, 95 teams use team then their number, and 264 teams use something else.

Using the script

This script is not on PYPI this time. You can obtain a copy from my GitHub repo: https://github.com/Ewpratten/frc-code-stats

First, make sure both python3.7 and python3-pip are installed on your computer. Next, delete the data.json file. Then, install the requirements with pip3 install -r requirements.txt. Finally, run with python3 main.py to start the script. Now, go outside and enjoy nature for about an hour, and your data should be loaded!.