Parse the requirements file

The previous notebook, 'PyPi_Metadata.ipynb', parsed the requirements out of every package on the pypi server. The output was a file that looks like this:

packages/astrodbkit-0.2.0

packages/astrodendro-0.1.0
aplpy
astropy
matplotlib
numpy

packages/astroid-1.4.4

packages/astroimtools-0.1
git+http://github.com/astropy/astropy.git#egg=astropy
astropy-helpers
cython>=0.23.4
distribute==0.0
matplotlib
numpy

The packages start with the name of the python package, followed by the dependencies I was able to parse. Many of them have no dependencies; for now I will assume that is correct even though I know it is not true. Any package that programmatically defines the requirements in the setup.py, and which have no requirements files, are not found.

The purpose of this notebook will largely just be to parse the output file into a pandas dataframe.

In [ ]:
import pandas as pd
from collections import defaultdict
import os
import numpy as np
import requirements
import xmlrpclib

# I need this to separate the package name from its version
client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
packages = client.list_packages()

1: Parse the requirements for each package

In [ ]:
datadict = defaultdict(list)
with open('requirements.txt', 'r') as infile:
    new_package = True
    for line in infile:
        if line.strip() == '':
            new_package = True
            #print(package_name)
            if package_name not in datadict['package']:
                datadict['package'].append(package_name)
                datadict['requirement'].append(np.nan)
            continue
        
        if new_package:
            # If this is the case, the current line gives the name of the package
            package_name = os.path.basename(line).strip()
            new_package = False
        else:
            # This line gives a requirement for the current package
            try:
                for req in requirements.parse(line.strip()):
                    datadict['package'].append(package_name)
                    datadict['requirement'].append(req.name)
            except ValueError:
                pass
                

# Convert to dataframe
df = pd.DataFrame(data=datadict)
df.head()

2: Get the base package name from the package string

The package column of the dataframe currently contains the name of the package as well as the version string. I need to separate the two. For that, I will use the package list from pypi itself again.

In [ ]:
df['package_name'] = np.nan
df['package_version'] = np.nan
for i, package in enumerate(packages):
    if i % 100 == 0:
        print('Package {}: {}'.format(i+1, package))
    for release in client.package_releases(package):
        pkg_str = '{}-{}'.format(package, release)
        idx = df.loc[df.package == pkg_str].index
        if len(idx) > 0:
            df.loc[idx, 'package_name'] = package
            df.loc[idx, 'package_version'] = release
df.head()
In [ ]:
# Save to file
df.to_csv('requirements.csv', index=False)

Base dependencies

I have now parsed the formal dependencies for 20642 python packages. However, some of those dependencies themselves have dependencies. Let's go ahead and find the base dependency. I will find all of the requirements that each requirements itself has, and keep going until there are no new dependencies.

Difficulties:

  1. Cyclic dependencies: astropy requires wcs_axes, which itself requires astropy. Therefore a naive recursive solution will never end. I use a Tree class that keeps track of what has already been searched to avoid infinite loops.
In [ ]:
class Tree(object):
    def __init__(self, name):
        self.name = name
        self.children = []
        return

    def __contains__(self, obj):
        return obj == self.name or any([obj in c for c in self.children])
    
    def add(self, obj):
        if not self.__contains__(obj):
            self.children.append(Tree(obj))
            return True
        return False
    
    def get_base_requirements(self):
        base = []
        for child in self.children:
            if len(child.children) == 0:
                base.append(child.name)
            else:
                for b in [c.get_base_requirements() for c in child.children()]:
                    base.extend(b)
        return np.unique(base)
    

def get_requirements(package):
    return df.loc[(df.package_name == package) & (df.requirement.notnull()), 'requirement'].values


def get_dependency_tree(package, tree):
    reqs = get_requirements(package)
    for req in reqs:
        #print(req)
        flg = tree.add(req)
        if not flg:
            continue
        tree = get_base_dependencies(req, tree)
    return tree

    
In [ ]:
p = '115wangpan'
p = 'astroquery'
get_dependency_tree(p, Tree(p)).get_base_requirements()
In [ ]:
datadict = defaultdict(list)
for i, package in enumerate(df.package_name.unique()):
    if i % 100 == 0:
        print('Package {}: {}'.format(i+1, package))
    try:
        deptree = get_dependency_tree(package, Tree(package))
    except:
        print('Failure getting base dependencies for {}'.format(package))
        raise ValueError
    for dependency in deptree.get_base_requirements():
        datadict['package_name'].append(package)
        datadict['requirements'].append(dependency)

base_df = pd.DataFrame(data=datadict)
base_df.head()
In [ ]:
base_df.to_csv('base_requirements.csv', index=False)