Package names

One of the motivations for this package was to investigate using visual distances to look out for issues similar to typosquatting in the Julia General package registry.

The problem of interest is the following: say a user is following a Julia tutorial online, but a malicious person has substituted a popular package name for a similiar-looking one in the tutorial. When the unsuspecting user copy-pastes the commands to install the package, they don't realize they are installing the malicious one. To prevent this kind of abuse, it could be useful to add an automated check to the registry process to check that new package registrations' names aren't very close visually to existing packages, and to perhaps issue a warning when they are.

visual_distance provides a means of evaluating how close two strings look. Let's investigate it in the context of package names.

Let us consider some visually-confusable names, and compute their visual distances, as well as a simple edit distance (the Damerau-Levenshtein distance).

julia> using VisualStringDistances, DataFrames, StringDistances
julia> const DL = DamerauLevenshtein();
julia> # Define our distance measure
       d(s1, s2) = visual_distance(s1, s2; normalize=x -> 5 + sqrt(x))d (generic function with 1 method)
julia> d((s1,s2)) = d(s1, s2)d (generic function with 2 methods)
julia> df_subs = DataFrame([
           ("jellyfish", "jeIlyfish"), # https://developer-tech.com/news/2019/dec/05/python-libraries-dateutil-jellyfish-stealing-ssh-gpg-keys/
           ("DifferentialEquations", "DifferentIalEquations"),
           ("ANOVA", "AN0VA"),
           ("ODEInterfaceDiffEq", "0DEInterfaceDiffEq"),
           ("ValueOrientedRiskManagementInsurance", "ValueOrientedRiskManagementlnsurance"),
           ("IsoPkg", "lsoPkg"),
           ("DiffEqNoiseProcess", "DiffEgNoiseProcess"),
           ("Graph500", "Graph5O0")
       ]);
julia> rename!(df_subs, [:name1, :name2]);
julia> df_subs.DL = DL.(df_subs.name1, df_subs.name2);
julia> df_subs.sqrt_normalized_DL = df_subs.DL ./ ( 5 .+ sqrt.(max.(length.(df_subs.name1), length.(df_subs.name2))) );
julia> df_subs.sqrt_normalized_visual_dist = d.(df_subs.name1, df_subs.name2);
julia> sort!(df_subs, :sqrt_normalized_visual_dist);

df_subs

8 rows × 5 columns

	name1	name2	DL	sqrt_normalized_DL	sqrt_normalized_visual_dist
	String	String	Int64	Float64	Float64
1	ValueOrientedRiskManagementInsurance	ValueOrientedRiskManagementlnsurance	1	0.0909091	0.295402
2	jellyfish	jeIlyfish	1	0.125	0.405719
3	IsoPkg	lsoPkg	1	0.134237	0.435678
4	DifferentialEquations	DifferentIalEquations	1	0.104356	0.441529
5	ODEInterfaceDiffEq	0DEInterfaceDiffEq	1	0.108194	0.871758
6	Graph500	Graph5O0	1	0.12774	1.02915
7	ANOVA	AN0VA	1	0.138197	1.11333
8	DiffEqNoiseProcess	DiffEgNoiseProcess	1	0.108194	1.58004

We can see all the pairs have DL distance of 1, since they are 1 edit apart. Their normalized DL-distances thus just depend on their length. However, they have various visual distances, depending on what subsitution was made. Note that GNU Unifont renders zeros with a slash through the middle, and hence VisualStringDistances.jl sees "O" and "0" as fairly different.

Let us compare to some real package names from the registry. We will in fact consider all package names, but then filter them down to a manageable list via the edit distance.

julia> using Pkg
julia> function get_all_package_names(registry_dir::AbstractString)
           packages = [x["name"] for x in values(Pkg.TOML.parsefile(joinpath(registry_dir, "Registry.toml"))["packages"])]
           sort!(packages)
           unique!(packages)
           return packages
       endget_all_package_names (generic function with 1 method)
julia> names = get_all_package_names(expanduser("~/.julia/registries/General"));
julia> filter!(x -> !endswith(x, "_jll"), names);
julia> @info "Loaded list of non-JLL package names ($(length(names)) names)"[ Info: Loaded list of non-JLL package names (5426 names)
julia> normalized_dl_cutoff = .2;
julia> dl_cutoff = 1;
julia> @info "Computing list of pairs of package names within $(dl_cutoff) in DL distance or $(normalized_dl_cutoff) in normalized DL distance..."[ Info: Computing list of pairs of package names within 1 in DL distance or 0.2 in normalized DL distance...
julia> @time df = DataFrame(collect( (name1=names[i],name2=names[j]) for i = 1:length(names) for j = 1:(i-1) if (normalize(DL)(names[i], names[j]) <= normalized_dl_cutoff) || DL(names[i], names[j]) <= dl_cutoff)); 39.524589 seconds (187.02 M allocations: 14.111 GiB, 5.22% gc time)
julia> @info "Found $(size(df,1)) pairs of packages meeting the criteria.";[ Info: Found 227 pairs of packages meeting the criteria.
julia> df.DL = DL.(df.name1, df.name2);
julia> df.sqrt_normalized_DL = df.DL ./ ( 5 .+ sqrt.(max.(length.(df.name1), length.(df.name2))) );
julia> @time df.sqrt_normalized_visual_dist = d.(df.name1, df.name2);126.439154 seconds (8.89 k allocations: 398.760 MiB, 0.02% gc time)

Let's look at the 5 closest pairs according to the normalized visual distance.

sort!(df, :sqrt_normalized_visual_dist);
df[1:5, :]

5 rows × 5 columns

	name1	name2	DL	sqrt_normalized_DL	sqrt_normalized_visual_dist
	String	String	Int64	Float64	Float64
1	BLASBenchmarksGPU	BLASBenchmarksCPU	1	0.109612	0.516984
2	BigO	BigG	1	0.142857	0.530978
3	Modia	Media	1	0.138197	0.550515
4	MRIReco	MPIReco	1	0.130792	0.594456
5	UpROOT	UnROOT	1	0.134237	0.628649

Here, we see that by this measurement, the closest pair of packages is "Modia" and "Media". Indeed they look fairly similar, although they are not as easy to mistake for each other as many of the earlier examples.

Let's compare to the 5 closest pairs according to the normalized edit distance.

sort!(df, :sqrt_normalized_DL);
df[1:5, :]

5 rows × 5 columns

	name1	name2	DL	sqrt_normalized_DL	sqrt_normalized_visual_dist
	String	String	Int64	Float64	Float64
1	ModiaPlot_WGLMakie	ModiaPlot_GLMakie	1	0.108194	13.4284
2	BLASBenchmarksGPU	BLASBenchmarksCPU	1	0.109612	0.516984
3	StanMCMCChains	StanMCMCChain	1	0.114395	2.12115
4	ForwardDiff2	ForwardDiff	1	0.118146	2.59408
5	ThreadTools	ThreadPools	1	0.120241	2.08876

These are just the longest package names that are 1 edit away from each other.