Package names
One of the motivations for this package was to investigate using visual distances to look out for issues similar to typosquatting in the Julia General package registry.
The problem of interest is the following: say a user is following a Julia tutorial online, but a malicious person has substituted a popular package name for a similiar-looking one in the tutorial. When the unsuspecting user copy-pastes the commands to install the package, they don't realize they are installing the malicious one. To prevent this kind of abuse, it could be useful to add an automated check to the registry process to check that new package registrations' names aren't very close visually to existing packages, and to perhaps issue a warning when they are.
visual_distance
provides a means of evaluating how close two strings look. Let's investigate it in the context of package names.
Let us consider some visually-confusable names, and compute their visual distances, as well as a simple edit distance (the Damerau-Levenshtein distance).
julia> using VisualStringDistances, DataFrames, StringDistances
julia> const DL = DamerauLevenshtein();
julia> # Define our distance measure d(s1, s2) = visual_distance(s1, s2; normalize=x -> 5 + sqrt(x))
d (generic function with 1 method)
julia> d((s1,s2)) = d(s1, s2)
d (generic function with 2 methods)
julia> df_subs = DataFrame([ ("jellyfish", "jeIlyfish"), # https://developer-tech.com/news/2019/dec/05/python-libraries-dateutil-jellyfish-stealing-ssh-gpg-keys/ ("DifferentialEquations", "DifferentIalEquations"), ("ANOVA", "AN0VA"), ("ODEInterfaceDiffEq", "0DEInterfaceDiffEq"), ("ValueOrientedRiskManagementInsurance", "ValueOrientedRiskManagementlnsurance"), ("IsoPkg", "lsoPkg"), ("DiffEqNoiseProcess", "DiffEgNoiseProcess"), ("Graph500", "Graph5O0") ]);
julia> rename!(df_subs, [:name1, :name2]);
julia> df_subs.DL = DL.(df_subs.name1, df_subs.name2);
julia> df_subs.sqrt_normalized_DL = df_subs.DL ./ ( 5 .+ sqrt.(max.(length.(df_subs.name1), length.(df_subs.name2))) );
julia> df_subs.sqrt_normalized_visual_dist = d.(df_subs.name1, df_subs.name2);
julia> sort!(df_subs, :sqrt_normalized_visual_dist);
df_subs
8 rows × 5 columns
name1 | name2 | DL | sqrt_normalized_DL | sqrt_normalized_visual_dist | |
---|---|---|---|---|---|
String | String | Int64 | Float64 | Float64 | |
1 | ValueOrientedRiskManagementInsurance | ValueOrientedRiskManagementlnsurance | 1 | 0.0909091 | 0.295402 |
2 | jellyfish | jeIlyfish | 1 | 0.125 | 0.405719 |
3 | IsoPkg | lsoPkg | 1 | 0.134237 | 0.435678 |
4 | DifferentialEquations | DifferentIalEquations | 1 | 0.104356 | 0.441529 |
5 | ODEInterfaceDiffEq | 0DEInterfaceDiffEq | 1 | 0.108194 | 0.871758 |
6 | Graph500 | Graph5O0 | 1 | 0.12774 | 1.02915 |
7 | ANOVA | AN0VA | 1 | 0.138197 | 1.11333 |
8 | DiffEqNoiseProcess | DiffEgNoiseProcess | 1 | 0.108194 | 1.58004 |
We can see all the pairs have DL distance of 1, since they are 1 edit apart. Their normalized DL-distances thus just depend on their length. However, they have various visual distances, depending on what subsitution was made. Note that GNU Unifont renders zeros with a slash through the middle, and hence VisualStringDistances.jl sees "O" and "0" as fairly different.
Let us compare to some real package names from the registry. We will in fact consider all package names, but then filter them down to a manageable list via the edit distance.
julia> using Pkg
julia> function get_all_package_names(registry_dir::AbstractString) packages = [x["name"] for x in values(Pkg.TOML.parsefile(joinpath(registry_dir, "Registry.toml"))["packages"])] sort!(packages) unique!(packages) return packages end
get_all_package_names (generic function with 1 method)
julia> names = get_all_package_names(expanduser("~/.julia/registries/General"));
julia> filter!(x -> !endswith(x, "_jll"), names);
julia> @info "Loaded list of non-JLL package names ($(length(names)) names)"
[ Info: Loaded list of non-JLL package names (5426 names)
julia> normalized_dl_cutoff = .2;
julia> dl_cutoff = 1;
julia> @info "Computing list of pairs of package names within $(dl_cutoff) in DL distance or $(normalized_dl_cutoff) in normalized DL distance..."
[ Info: Computing list of pairs of package names within 1 in DL distance or 0.2 in normalized DL distance...
julia> @time df = DataFrame(collect( (name1=names[i],name2=names[j]) for i = 1:length(names) for j = 1:(i-1) if (normalize(DL)(names[i], names[j]) <= normalized_dl_cutoff) || DL(names[i], names[j]) <= dl_cutoff));
39.524589 seconds (187.02 M allocations: 14.111 GiB, 5.22% gc time)
julia> @info "Found $(size(df,1)) pairs of packages meeting the criteria.";
[ Info: Found 227 pairs of packages meeting the criteria.
julia> df.DL = DL.(df.name1, df.name2);
julia> df.sqrt_normalized_DL = df.DL ./ ( 5 .+ sqrt.(max.(length.(df.name1), length.(df.name2))) );
julia> @time df.sqrt_normalized_visual_dist = d.(df.name1, df.name2);
126.439154 seconds (8.89 k allocations: 398.760 MiB, 0.02% gc time)
Let's look at the 5 closest pairs according to the normalized visual distance.
sort!(df, :sqrt_normalized_visual_dist);
df[1:5, :]
5 rows × 5 columns
name1 | name2 | DL | sqrt_normalized_DL | sqrt_normalized_visual_dist | |
---|---|---|---|---|---|
String | String | Int64 | Float64 | Float64 | |
1 | BLASBenchmarksGPU | BLASBenchmarksCPU | 1 | 0.109612 | 0.516984 |
2 | BigO | BigG | 1 | 0.142857 | 0.530978 |
3 | Modia | Media | 1 | 0.138197 | 0.550515 |
4 | MRIReco | MPIReco | 1 | 0.130792 | 0.594456 |
5 | UpROOT | UnROOT | 1 | 0.134237 | 0.628649 |
Here, we see that by this measurement, the closest pair of packages is "Modia" and "Media". Indeed they look fairly similar, although they are not as easy to mistake for each other as many of the earlier examples.
Let's compare to the 5 closest pairs according to the normalized edit distance.
sort!(df, :sqrt_normalized_DL);
df[1:5, :]
5 rows × 5 columns
name1 | name2 | DL | sqrt_normalized_DL | sqrt_normalized_visual_dist | |
---|---|---|---|---|---|
String | String | Int64 | Float64 | Float64 | |
1 | ModiaPlot_WGLMakie | ModiaPlot_GLMakie | 1 | 0.108194 | 13.4284 |
2 | BLASBenchmarksGPU | BLASBenchmarksCPU | 1 | 0.109612 | 0.516984 |
3 | StanMCMCChains | StanMCMCChain | 1 | 0.114395 | 2.12115 |
4 | ForwardDiff2 | ForwardDiff | 1 | 0.118146 | 2.59408 |
5 | ThreadTools | ThreadPools | 1 | 0.120241 | 2.08876 |
These are just the longest package names that are 1 edit away from each other.