The Smack Dataset

The Smack Dataset does not exist. In the future, if it arises, it will be a libre build of The Stack dataset without using the original dataset directly due to non-libre (non-“open source”) license encumbrances.

The Stack Metadata

The Stack has a separate metadata repository containing information about the dataset without hosting the dataset itself. This practice is beneficial as it allows researchers to understand dataset contents without being bound by licenses. For instance, how can one agree to a license when they’re unaware of the content’s licenses? By using metadata files, this issue can be mitigated.

Link to the Git Repository:

git clone https://huggingface.co/datasets/bigcode/the-stack-metadata

Downloading Metadata

The metadata is considerably less than the entire dataset, but still substantially large. The Git repository is approximately one terabyte in size.

Reading Metadata

The Stack’s metadata is stored in parquet format, a welcomed choice. The parquet files span 562 gigabytes and consist of 2,832 individual files across 945 directories.

Selecting Repos

Write a script to filter appropriate repositories based on libre criteria.

Cloning Repos

Write a script to clone the selected repositories.

Train

Utilize libre code from Bigcode (creators of The Stack) for model training.

Scripts

The following scripts are available:

  • the-stack-headers - Retrieves header names from The Stack’s parquet files.

  • the-stack-licenses - Extracts licenses and records from The Stack’s license file.

Code Assist

The following scripts were developed using Parrot code assist:

  • the-stack-headers

  • the-stack-licenses

These scripts were created with the The Phind-CodeLlama-34B-v2_q8.guff model from TheBloke.

This script is designed to read and print specific records from the lic.parquet file in a numbered directory under the data/ subdirectory.

Example usage: python3 script.py –records 1-5 -c

Command-line options:
-h, --help

show this help message and exit

--version

show program’s version number and exit

-r RANGE, --records=RANGE

record number or range to print (e.g., 1, 5-7)

-c, --color

colorize the output

-l, --list_licenses

list unique licenses in the file

the_smack.the_stack_licenses.get_records(dataframe, args)

Extract records from a DataFrame based on user-specified range.

Parameters:

dataframe (DataFrame): The pandas DataFrame to extract records from. args (Namespace): A namespace object containing parsed command line arguments.

Returns:

DataFrame: The extracted records as a new DataFrame.

the_smack.the_stack_licenses.main()

Main function to parse command line arguments and run the script.

the_smack.the_stack_licenses.print_records(dataframe, color)

Print the records in a DataFrame with optional colorization.

Parameters:

dataframe (DataFrame): The pandas DataFrame to print. color (bool): If True, colorize the output.

the_smack.the_stack_licenses.print_unique_licenses(dataframe)

Print the unique licenses in a DataFrame, sorted alphabetically.

Parameters:

dataframe (DataFrame): The pandas DataFrame to extract licenses from.