Package names

One of the motivations for this package was to investigate using visual distances to look out for issues similar to typosquatting in the Julia General package registry.

The problem of interest is the following: say a user is following a Julia tutorial online, but a malicious person has substituted a popular package name for a similiar-looking one in the tutorial. When the unsuspecting user copy-pastes the commands to install the package, they don't realize they are installing the malicious one. To prevent this kind of abuse, it could be useful to add an automated check to the registry process to check that new package registrations' names aren't very close visually to existing packages, and to perhaps issue a warning when they are.

visual_distance provides a means of evaluating how close two strings look. Let's investigate it in the context of package names.

Let us consider some visually-confusable names, and compute their visual distances, as well as a simple edit distance (the Damerau-Levenshtein distance).

julia> using VisualStringDistances, DataFrames, StringDistances
julia> const DL = DamerauLevenshtein();
julia> # Define our distance measure d(s1, s2) = visual_distance(s1, s2; normalize=x -> 5 + sqrt(x))d (generic function with 1 method)
julia> d((s1,s2)) = d(s1, s2)d (generic function with 2 methods)
julia> df_subs = DataFrame([ ("jellyfish", "jeIlyfish"), # https://developer-tech.com/news/2019/dec/05/python-libraries-dateutil-jellyfish-stealing-ssh-gpg-keys/ ("DifferentialEquations", "DifferentIalEquations"), ("ANOVA", "AN0VA"), ("ODEInterfaceDiffEq", "0DEInterfaceDiffEq"), ("ValueOrientedRiskManagementInsurance", "ValueOrientedRiskManagementlnsurance"), ("IsoPkg", "lsoPkg"), ("DiffEqNoiseProcess", "DiffEgNoiseProcess"), ("Graph500", "Graph5O0") ]);
julia> rename!(df_subs, [:name1, :name2]);
julia> df_subs.DL = DL.(df_subs.name1, df_subs.name2);
julia> df_subs.sqrt_normalized_DL = df_subs.DL ./ ( 5 .+ sqrt.(max.(length.(df_subs.name1), length.(df_subs.name2))) );
julia> df_subs.sqrt_normalized_visual_dist = d.(df_subs.name1, df_subs.name2);
julia> sort!(df_subs, :sqrt_normalized_visual_dist);
df_subs

8 rows × 5 columns

name1name2DLsqrt_normalized_DLsqrt_normalized_visual_dist
StringStringInt64Float64Float64
1ValueOrientedRiskManagementInsuranceValueOrientedRiskManagementlnsurance10.09090910.295402
2jellyfishjeIlyfish10.1250.405719
3IsoPkglsoPkg10.1342370.435678
4DifferentialEquationsDifferentIalEquations10.1043560.441529
5ODEInterfaceDiffEq0DEInterfaceDiffEq10.1081940.871758
6Graph500Graph5O010.127741.02915
7ANOVAAN0VA10.1381971.11333
8DiffEqNoiseProcessDiffEgNoiseProcess10.1081941.58004

We can see all the pairs have DL distance of 1, since they are 1 edit apart. Their normalized DL-distances thus just depend on their length. However, they have various visual distances, depending on what subsitution was made. Note that GNU Unifont renders zeros with a slash through the middle, and hence VisualStringDistances.jl sees "O" and "0" as fairly different.

Let us compare to some real package names from the registry. We will in fact consider all package names, but then filter them down to a manageable list via the edit distance.

julia> using Pkg
julia> function get_all_package_names(registry_dir::AbstractString) packages = [x["name"] for x in values(Pkg.TOML.parsefile(joinpath(registry_dir, "Registry.toml"))["packages"])] sort!(packages) unique!(packages) return packages endget_all_package_names (generic function with 1 method)
julia> names = get_all_package_names(expanduser("~/.julia/registries/General"));
julia> filter!(x -> !endswith(x, "_jll"), names);
julia> @info "Loaded list of non-JLL package names ($(length(names)) names)"[ Info: Loaded list of non-JLL package names (5426 names)
julia> normalized_dl_cutoff = .2;
julia> dl_cutoff = 1;
julia> @info "Computing list of pairs of package names within $(dl_cutoff) in DL distance or $(normalized_dl_cutoff) in normalized DL distance..."[ Info: Computing list of pairs of package names within 1 in DL distance or 0.2 in normalized DL distance...
julia> @time df = DataFrame(collect( (name1=names[i],name2=names[j]) for i = 1:length(names) for j = 1:(i-1) if (normalize(DL)(names[i], names[j]) <= normalized_dl_cutoff) || DL(names[i], names[j]) <= dl_cutoff)); 39.524589 seconds (187.02 M allocations: 14.111 GiB, 5.22% gc time)
julia> @info "Found $(size(df,1)) pairs of packages meeting the criteria.";[ Info: Found 227 pairs of packages meeting the criteria.
julia> df.DL = DL.(df.name1, df.name2);
julia> df.sqrt_normalized_DL = df.DL ./ ( 5 .+ sqrt.(max.(length.(df.name1), length.(df.name2))) );
julia> @time df.sqrt_normalized_visual_dist = d.(df.name1, df.name2);126.439154 seconds (8.89 k allocations: 398.760 MiB, 0.02% gc time)

Let's look at the 5 closest pairs according to the normalized visual distance.

sort!(df, :sqrt_normalized_visual_dist);
df[1:5, :]

5 rows × 5 columns

name1name2DLsqrt_normalized_DLsqrt_normalized_visual_dist
StringStringInt64Float64Float64
1BLASBenchmarksGPUBLASBenchmarksCPU10.1096120.516984
2BigOBigG10.1428570.530978
3ModiaMedia10.1381970.550515
4MRIRecoMPIReco10.1307920.594456
5UpROOTUnROOT10.1342370.628649

Here, we see that by this measurement, the closest pair of packages is "Modia" and "Media". Indeed they look fairly similar, although they are not as easy to mistake for each other as many of the earlier examples.

Let's compare to the 5 closest pairs according to the normalized edit distance.

sort!(df, :sqrt_normalized_DL);
df[1:5, :]

5 rows × 5 columns

name1name2DLsqrt_normalized_DLsqrt_normalized_visual_dist
StringStringInt64Float64Float64
1ModiaPlot_WGLMakieModiaPlot_GLMakie10.10819413.4284
2BLASBenchmarksGPUBLASBenchmarksCPU10.1096120.516984
3StanMCMCChainsStanMCMCChain10.1143952.12115
4ForwardDiff2ForwardDiff10.1181462.59408
5ThreadToolsThreadPools10.1202412.08876

These are just the longest package names that are 1 edit away from each other.