Search | arXiv e-print repository

SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution

Authors: Soufiane Belharbi, Mara KM Whitford, Phuong Hoang, Shakeeb Murtaza, Luke McCaffrey, Eric Granger

Abstract: Confocal fluorescence microscopy is one of the most accessible and widely used imaging techniques for the study of biological processes. Scanning confocal microscopy allows the capture of high-quality images from 3D samples, yet suffers from well-known limitations such as photobleaching and phototoxicity of specimens caused by intense light exposure, which limits its use in some applications, espe… ▽ More Confocal fluorescence microscopy is one of the most accessible and widely used imaging techniques for the study of biological processes. Scanning confocal microscopy allows the capture of high-quality images from 3D samples, yet suffers from well-known limitations such as photobleaching and phototoxicity of specimens caused by intense light exposure, which limits its use in some applications, especially for living cells. Cellular damage can be alleviated by changing imaging parameters to reduce light exposure, often at the expense of image quality. Machine/deep learning methods for single-image super-resolution (SISR) can be applied to restore image quality by upscaling lower-resolution (LR) images to produce high-resolution images (HR). These SISR methods have been successfully applied to photo-realistic images due partly to the abundance of publicly available data. In contrast, the lack of publicly available data partly limits their application and success in scanning confocal microscopy. In this paper, we introduce a large scanning confocal microscopy dataset named SR-CACO-2 that is comprised of low- and high-resolution image pairs marked for three different fluorescent markers. It allows the evaluation of performance of SISR methods on three different upscaling levels (X2, X4, X8). SR-CACO-2 contains the human epithelial cell line Caco-2 (ATCC HTB-37), and it is composed of 22 tiles that have been translated in the form of 9,937 image patches for experiments with SISR methods. Given the new SR-CACO-2 dataset, we also provide benchmarking results for 15 state-of-the-art methods that are representative of the main SISR families. Results show that these methods have limited success in producing high-resolution textures, indicating that SR-CACO-2 represents a challenging problem. Our dataset, code and pretrained weights are available: https://github.com/sbelharbi/sr-caco-2. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 23 pages, 13 figures

arXiv:2404.16741 [pdf, other]

Parameterized Complexity of Efficient Sortation

Authors: Robert Ganian, Hung P. Hoang, Simon Wietheger

Abstract: A crucial challenge arising in the design of large-scale logistical networks is to optimize parcel sortation for routing. We study this problem under the recent graph-theoretic formalization of Van Dyk, Klause, Koenemann and Megow (IPCO 2024). The problem asks - given an input digraph D (the fulfillment network) together with a set of commodities represented as source-sink tuples - for a minimum-o… ▽ More A crucial challenge arising in the design of large-scale logistical networks is to optimize parcel sortation for routing. We study this problem under the recent graph-theoretic formalization of Van Dyk, Klause, Koenemann and Megow (IPCO 2024). The problem asks - given an input digraph D (the fulfillment network) together with a set of commodities represented as source-sink tuples - for a minimum-outdegree subgraph H of the transitive closure of D that contains a source-sink route for each of the commodities. Given the underlying motivation, we study two variants of the problem which differ in whether the routes for the commodities are assumed to be given, or can be chosen arbitrarily. We perform a thorough parameterized analysis of the complexity of both problems. Our results concentrate on three fundamental parameterizations of the problem: (1) When attempting to parameterize by the target outdegree of H, we show that the problems are paraNP-hard even in highly restricted cases; (2) When parameterizing by the number of commodities, we utilize Ramsey-type arguments, kernelization and treewidth reduction techniques to obtain parameterized algorithms for both problems; (3) When parameterizing by the structure of D, we establish fixed-parameter tractability for both problems w.r.t. treewidth, maximum degree and the maximum routing length. We combine this with lower bounds which show that omitting any of the three parameters results in paraNP-hardness. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2403.15882 [pdf, other]

VLUE: A New Benchmark and Multi-task Knowledge Transfer Learning for Vietnamese Natural Language Understanding

Authors: Phong Nguyen-Thuan Do, Son Quoc Tran, Phu Gia Hoang, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

Abstract: The success of Natural Language Understanding (NLU) benchmarks in various languages, such as GLUE for English, CLUE for Chinese, KLUE for Korean, and IndoNLU for Indonesian, has facilitated the evaluation of new NLU models across a wide range of tasks. To establish a standardized set of benchmarks for Vietnamese NLU, we introduce the first Vietnamese Language Understanding Evaluation (VLUE) benchm… ▽ More The success of Natural Language Understanding (NLU) benchmarks in various languages, such as GLUE for English, CLUE for Chinese, KLUE for Korean, and IndoNLU for Indonesian, has facilitated the evaluation of new NLU models across a wide range of tasks. To establish a standardized set of benchmarks for Vietnamese NLU, we introduce the first Vietnamese Language Understanding Evaluation (VLUE) benchmark. The VLUE benchmark encompasses five datasets covering different NLU tasks, including text classification, span extraction, and natural language understanding. To provide an insightful overview of the current state of Vietnamese NLU, we then evaluate seven state-of-the-art pre-trained models, including both multilingual and Vietnamese monolingual models, on our proposed VLUE benchmark. Furthermore, we present CafeBERT, a new state-of-the-art pre-trained model that achieves superior results across all tasks in the VLUE benchmark. Our model combines the proficiency of a multilingual pre-trained model with Vietnamese linguistic knowledge. CafeBERT is developed based on the XLM-RoBERTa model, with an additional pretraining step utilizing a significant amount of Vietnamese textual data to enhance its adaptation to the Vietnamese language. For the purpose of future research, CafeBERT is made publicly available for research purposes. △ Less

Submitted 23 March, 2024; originally announced March 2024.

Comments: Accepted at NAACL 2024 (Findings)

arXiv:2402.07061 [pdf, other]

The $k$-Opt algorithm for the Traveling Salesman Problem has exponential running time for $k \ge 5$

Authors: Sophia Heimann, Hung P. Hoang, Stefan Hougardy

Abstract: The $k$-Opt algorithm is a local search algorithm for the Traveling Salesman Problem. Starting with an initial tour, it iteratively replaces at most $k$ edges in the tour with the same number of edges to obtain a better tour. Krentel (FOCS 1989) showed that the Traveling Salesman Problem with the $k$-Opt neighborhood is complete for the class PLS (polynomial time local search) and that the $k$-Opt… ▽ More The $k$-Opt algorithm is a local search algorithm for the Traveling Salesman Problem. Starting with an initial tour, it iteratively replaces at most $k$ edges in the tour with the same number of edges to obtain a better tour. Krentel (FOCS 1989) showed that the Traveling Salesman Problem with the $k$-Opt neighborhood is complete for the class PLS (polynomial time local search) and that the $k$-Opt algorithm can have exponential running time for any pivot rule. However, his proof requires $k \gg 1000$ and has a substantial gap. We show the two properties above for a much smaller value of $k$, addressing an open question by Monien, Dumrauf, and Tscheuschner (ICALP 2010). In particular, we prove the PLS-completeness for $k \geq 17$ and the exponential running time for $k \geq 5$. △ Less

Submitted 13 June, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

Comments: Appeared in ICALP 2024

MSC Class: 68W25; 68W40; 68Q25; 90C27 ACM Class: F.2.2; G.2.1; G.2.2

arXiv:2401.10044 [pdf, other]

Deep spatial context: when attention-based models meet spatial regression

Authors: Paulina Tomaszewska, Elżbieta Sienkiewicz, Mai P. Hoang, Przemysław Biecek

Abstract: We propose 'Deep spatial context' (DSCon) method, which serves for investigation of the attention-based vision models using the concept of spatial context. It was inspired by histopathologists, however, the method can be applied to various domains. The DSCon allows for a quantitative measure of the spatial context's role using three Spatial Context Measures: $SCM_{features}$, $SCM_{targets}$,… ▽ More We propose 'Deep spatial context' (DSCon) method, which serves for investigation of the attention-based vision models using the concept of spatial context. It was inspired by histopathologists, however, the method can be applied to various domains. The DSCon allows for a quantitative measure of the spatial context's role using three Spatial Context Measures: $SCM_{features}$, $SCM_{targets}$, $SCM_{residuals}$ to distinguish whether the spatial context is observable within the features of neighboring regions, their target values (attention scores) or residuals, respectively. It is achieved by integrating spatial regression into the pipeline. The DSCon helps to verify research questions. The experiments reveal that spatial relationships are much bigger in the case of the classification of tumor lesions than normal tissues. Moreover, it turns out that the larger the size of the neighborhood taken into account within spatial regression, the less valuable contextual information is. Furthermore, it is observed that the spatial context measure is the largest when considered within the feature space as opposed to the targets and residuals. △ Less

Submitted 10 March, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

arXiv:2311.15297 [pdf, other]

Controllable Expensive Multi-objective Learning with Warm-starting Bayesian Optimization

Authors: Quang-Huy Nguyen, Long P. Hoang, Hoang V. Viet, Dung D. Le

Abstract: Pareto Set Learning (PSL) is a promising approach for approximating the entire Pareto front in multi-objective optimization (MOO) problems. However, existing derivative-free PSL methods are often unstable and inefficient, especially for expensive black-box MOO problems where objective function evaluations are costly. In this work, we propose to address the instability and inefficiency of existing… ▽ More Pareto Set Learning (PSL) is a promising approach for approximating the entire Pareto front in multi-objective optimization (MOO) problems. However, existing derivative-free PSL methods are often unstable and inefficient, especially for expensive black-box MOO problems where objective function evaluations are costly. In this work, we propose to address the instability and inefficiency of existing PSL methods with a novel controllable PSL method, called Co-PSL. Particularly, Co-PSL consists of two stages: (1) warm-starting Bayesian optimization to obtain quality Gaussian Processes priors and (2) controllable Pareto set learning to accurately acquire a parametric map** from preferences to the corresponding Pareto solutions. The former is to help stabilize the PSL process and reduce the number of expensive function evaluations. The latter is to support real-time trade-off control between conflicting objectives. Performances across synthesis and real-world MOO problems showcase the effectiveness of our Co-PSL for expensive multi-objective optimization tasks. △ Less

Submitted 9 February, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

arXiv:2304.04835 [pdf, other]

doi 10.1145/3543507.3583189

Measuring and Evading Turkmenistan's Internet Censorship: A Case Study in Large-Scale Measurements of a Low-Penetration Country

Authors: Sadia Nourin, Van Tran, Xi Jiang, Kevin Bock, Nick Feamster, Nguyen Phong Hoang, Dave Levin

Abstract: Since 2006, Turkmenistan has been listed as one of the few Internet enemies by Reporters without Borders due to its extensively censored Internet and strictly regulated information control policies. Existing reports of filtering in Turkmenistan rely on a small number of vantage points or test a small number of websites. Yet, the country's poor Internet adoption rates and small population can make… ▽ More Since 2006, Turkmenistan has been listed as one of the few Internet enemies by Reporters without Borders due to its extensively censored Internet and strictly regulated information control policies. Existing reports of filtering in Turkmenistan rely on a small number of vantage points or test a small number of websites. Yet, the country's poor Internet adoption rates and small population can make more comprehensive measurement challenging. With a population of only six million people and an Internet penetration rate of only 38%, it is challenging to either recruit in-country volunteers or obtain vantage points to conduct remote network measurements at scale. We present the largest measurement study to date of Turkmenistan's Web censorship. To do so, we developed TMC, which tests the blocking status of millions of domains across the three foundational protocols of the Web (DNS, HTTP, and HTTPS). Importantly, TMC does not require access to vantage points in the country. We apply TMC to 15.5M domains, our results reveal that Turkmenistan censors more than 122K domains, using different blocklists for each protocol. We also reverse-engineer these censored domains, identifying 6K over-blocking rules causing incidental filtering of more than 5.4M domains. Finally, we use Geneva, an open-source censorship evasion tool, to discover five new censorship evasion strategies that can defeat Turkmenistan's censorship at both transport and application layers. We will publicly release both the data collected by TMC and the code for censorship evasion. △ Less

Submitted 17 April, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

Comments: To appear in Proceedings of The 2023 ACM Web Conference (WWW 2023)

arXiv:2303.07401 [pdf, other]

Drawings of Complete Multipartite Graphs Up to Triangle Flips

Authors: Oswin Aichholzer, Man-Kwun Chiu, Hung P. Hoang, Michael Hoffmann, Jan Kynčl, Yannic Maus, Birgit Vogtenhuber, Alexandra Weinberger

Abstract: For a drawing of a labeled graph, the rotation of a vertex or crossing is the cyclic order of its incident edges, represented by the labels of their other endpoints. The extended rotation system (ERS) of the drawing is the collection of the rotations of all vertices and crossings. A drawing is simple if each pair of edges has at most one common point. Gioan's Theorem states that for any two simple… ▽ More For a drawing of a labeled graph, the rotation of a vertex or crossing is the cyclic order of its incident edges, represented by the labels of their other endpoints. The extended rotation system (ERS) of the drawing is the collection of the rotations of all vertices and crossings. A drawing is simple if each pair of edges has at most one common point. Gioan's Theorem states that for any two simple drawings of the complete graph $K_n$ with the same crossing edge pairs, one drawing can be transformed into the other by a sequence of triangle flips (a.k.a. Reidemeister moves of Type 3). This operation refers to the act of moving one edge of a triangular cell formed by three pairwise crossing edges over the opposite crossing of the cell, via a local transformation. We investigate to what extent Gioan-type theorems can be obtained for wider classes of graphs. A necessary (but in general not sufficient) condition for two drawings of a graph to be transformable into each other by a sequence of triangle flips is that they have the same ERS. As our main result, we show that for the large class of complete multipartite graphs, this necessary condition is in fact also sufficient. We present two different proofs of this result, one of which is shorter, while the other one yields a polynomial time algorithm for which the number of needed triangle flips for graphs on $n$ vertices is bounded by $O(n^{16})$. The latter proof uses a Carathéodory-type theorem for simple drawings of complete multipartite graphs, which we believe to be of independent interest. Moreover, we show that our Gioan-type theorem for complete multipartite graphs is essentially tight in the sense that having the same ERS does not remain sufficient when removing or adding very few edges. △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: Abstract shortened for arxiv. This work (without appendix) is available at the 39th International Symposium on Computational Geometry (SoCG 2023)

arXiv:2302.02031 [pdf, other]

Augmenting Rule-based DNS Censorship Detection at Scale with Machine Learning

Authors: Jacob Brown, Xi Jiang, Van Tran, Arjun Nitin Bhagoji, Nguyen Phong Hoang, Nick Feamster, Prateek Mittal, Vinod Yegneswaran

Abstract: The proliferation of global censorship has led to the development of a plethora of measurement platforms to monitor and expose it. Censorship of the domain name system (DNS) is a key mechanism used across different countries. It is currently detected by applying heuristics to samples of DNS queries and responses (probes) for specific destinations. These heuristics, however, are both platform-speci… ▽ More The proliferation of global censorship has led to the development of a plethora of measurement platforms to monitor and expose it. Censorship of the domain name system (DNS) is a key mechanism used across different countries. It is currently detected by applying heuristics to samples of DNS queries and responses (probes) for specific destinations. These heuristics, however, are both platform-specific and have been found to be brittle when censors change their blocking behavior, necessitating a more reliable automated process for detecting censorship. In this paper, we explore how machine learning (ML) models can (1) help streamline the detection process, (2) improve the potential of using large-scale datasets for censorship detection, and (3) discover new censorship instances and blocking signatures missed by existing heuristic methods. Our study shows that supervised models, trained using expert-derived labels on instances of known anomalies and possible censorship, can learn the detection heuristics employed by different measurement platforms. More crucially, we find that unsupervised models, trained solely on uncensored instances, can identify new instances and variations of censorship missed by existing heuristics. Moreover, both methods demonstrate the capability to uncover a substantial number of new DNS blocking signatures, i.e., injected fake IP addresses overlooked by existing heuristics. These results are underpinned by an important methodological finding: comparing the outputs of models trained using the same probes but with labels arising from independent processes allows us to more reliably detect cases of censorship in the absence of ground-truth labels of censorship. △ Less

Submitted 15 June, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

Comments: To appear in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '23)

arXiv:2301.10186 [pdf, other]

ViHOS: Hate Speech Spans Detection for Vietnamese

Authors: Phu Gia Hoang, Canh Duc Luu, Khanh Quoc Tran, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

Abstract: The rise in hateful and offensive language directed at other users is one of the adverse side effects of the increased use of social networking platforms. This could make it difficult for human moderators to review tagged comments filtered by classification systems. To help address this issue, we present the ViHOS (Vietnamese Hate and Offensive Spans) dataset, the first human-annotated corpus cont… ▽ More The rise in hateful and offensive language directed at other users is one of the adverse side effects of the increased use of social networking platforms. This could make it difficult for human moderators to review tagged comments filtered by classification systems. To help address this issue, we present the ViHOS (Vietnamese Hate and Offensive Spans) dataset, the first human-annotated corpus containing 26k spans on 11k comments. We also provide definitions of hateful and offensive spans in Vietnamese comments as well as detailed annotation guidelines. Besides, we conduct experiments with various state-of-the-art models. Specifically, XLM-R$_{Large}$ achieved the best F1-scores in Single span detection and All spans detection, while PhoBERT$_{Large}$ obtained the highest in Multiple spans detection. Finally, our error analysis demonstrates the difficulties in detecting specific types of spans in our data for future research. Disclaimer: This paper contains real comments that could be considered profane, offensive, or abusive. △ Less

Submitted 26 January, 2023; v1 submitted 24 January, 2023; originally announced January 2023.

Comments: EACL 2023

arXiv:2212.03915 [pdf, other]

Combinatorial generation via permutation languages. V. Acyclic orientations

Authors: Jean Cardinal, Hung P. Hoang, Arturo Merino, Ondřej Mička, Torsten Mütze

Abstract: In 1993, Savage, Squire, and West described an inductive construction for generating every acyclic orientation of a chordal graph exactly once, flip** one arc at a time. We provide two generalizations of this result. Firstly, we describe Gray codes for acyclic orientations of hypergraphs that satisfy a simple ordering condition, which generalizes the notion of perfect elimination order of graphs… ▽ More In 1993, Savage, Squire, and West described an inductive construction for generating every acyclic orientation of a chordal graph exactly once, flip** one arc at a time. We provide two generalizations of this result. Firstly, we describe Gray codes for acyclic orientations of hypergraphs that satisfy a simple ordering condition, which generalizes the notion of perfect elimination order of graphs. This unifies the Savage-Squire-West construction with a recent algorithm for generating elimination trees of chordal graphs. Secondly, we consider quotients of lattices of acyclic orientations of chordal graphs, and we provide a Gray code for them, addressing a question raised by Pilaud. This also generalizes a recent algorithm for generating lattice congruences of the weak order on the symmetric group. Our algorithms are derived from the Hartung-Hoang-Mütze-Williams combinatorial generation framework, and they yield simple algorithms for computing Hamilton paths and cycles on large classes of polytopes, including chordal nestohedra and quotientopes. In particular, we derive an efficient implementation of the Savage-Squire-West construction. Along the way, we give an overview of old and recent results about the polyhedral and order-theoretic aspects of acyclic orientations of graphs and hypergraphs. △ Less

Submitted 7 December, 2022; originally announced December 2022.

arXiv:2212.01130 [pdf, other]

Improving Pareto Front Learning via Multi-Sample Hypernetworks

Authors: Long P. Hoang, Dung D. Le, Tran Anh Tuan, Tran Ngoc Thang

Abstract: Pareto Front Learning (PFL) was recently introduced as an effective approach to obtain a map** function from a given trade-off vector to a solution on the Pareto front, which solves the multi-objective optimization (MOO) problem. Due to the inherent trade-off between conflicting objectives, PFL offers a flexible approach in many scenarios in which the decision makers can not specify the preferen… ▽ More Pareto Front Learning (PFL) was recently introduced as an effective approach to obtain a map** function from a given trade-off vector to a solution on the Pareto front, which solves the multi-objective optimization (MOO) problem. Due to the inherent trade-off between conflicting objectives, PFL offers a flexible approach in many scenarios in which the decision makers can not specify the preference of one Pareto solution over another, and must switch between them depending on the situation. However, existing PFL methods ignore the relationship between the solutions during the optimization process, which hinders the quality of the obtained front. To overcome this issue, we propose a novel PFL framework namely PHN-HVI, which employs a hypernetwork to generate multiple solutions from a set of diverse trade-off preferences and enhance the quality of the Pareto front by maximizing the Hypervolume indicator defined by these solutions. The experimental results on several MOO machine learning tasks show that the proposed framework significantly outperforms the baselines in producing the trade-off Pareto front. △ Less

Submitted 28 April, 2023; v1 submitted 2 December, 2022; originally announced December 2022.

Comments: Accepted to AAAI-23

arXiv:2210.11374 [pdf, other]

Meeting Decision Tracker: Making Meeting Minutes with De-Contextualized Utterances

Authors: Shumpei Inoue, Hy Nguyen, Pham Viet Hoang, Tsungwei Liu, Minh-Tien Nguyen

Abstract: Meetings are a universal process to make decisions in business and project collaboration. The capability to automatically itemize the decisions in daily meetings allows for extensive tracking of past discussions. To that end, we developed Meeting Decision Tracker, a prototype system to construct decision items comprising decision utterance detector (DUD) and decision utterance rewriter (DUR). We s… ▽ More Meetings are a universal process to make decisions in business and project collaboration. The capability to automatically itemize the decisions in daily meetings allows for extensive tracking of past discussions. To that end, we developed Meeting Decision Tracker, a prototype system to construct decision items comprising decision utterance detector (DUD) and decision utterance rewriter (DUR). We show that DUR makes a sizable contribution to improving the user experience by dealing with utterance collapse in natural conversation. An introduction video of our system is also available at https://youtu.be/TG1pJJo0Iqo. △ Less

Submitted 20 October, 2022; originally announced October 2022.

Comments: 7 pages, AACL-IJCNLP 2022

arXiv:2206.09662 [pdf, other]

doi 10.1016/j.disc.2023.113528

On approximating the rank of graph divisors

Authors: Kristóf Bérczi, Hung P. Hoang, Lilla Tóthmérész

Abstract: Baker and Norine initiated the study of graph divisors as a graph-theoretic analogue of the Riemann-Roch theory for Riemann surfaces. One of the key concepts of graph divisor theory is the {\it rank} of a divisor on a graph. The importance of the rank is well illustrated by Baker's {\it Specialization lemma}, stating that the dimension of a linear system can only go up under specialization from cu… ▽ More Baker and Norine initiated the study of graph divisors as a graph-theoretic analogue of the Riemann-Roch theory for Riemann surfaces. One of the key concepts of graph divisor theory is the {\it rank} of a divisor on a graph. The importance of the rank is well illustrated by Baker's {\it Specialization lemma}, stating that the dimension of a linear system can only go up under specialization from curves to graphs, leading to a fruitful interaction between divisors on graphs and curves. Due to its decisive role, determining the rank is a central problem in graph divisor theory. Kiss and Tóthméresz reformulated the problem using chip-firing games, and showed that computing the rank of a divisor on a graph is NP-hard via reduction from the Minimum Feedback Arc Set problem. In this paper, we strengthen their result by establishing a connection between chip-firing games and the Minimum Target Set Selection problem. As a corollary, we show that the rank is difficult to approximate to within a factor of $O(2^{\log^{1-\varepsilon}n})$ for any $\varepsilon > 0$ unless $P=NP$. Furthermore, assuming the Planted Dense Subgraph Conjecture, the rank is difficult to approximate to within a factor of $O(n^{1/4-\varepsilon})$ for any $\varepsilon>0$. △ Less

Submitted 11 April, 2024; v1 submitted 20 June, 2022; originally announced June 2022.

Comments: 11 pages, 3 figures

Journal ref: Discrete Math. 346 (2023), no. 9, Paper No. 113528, 8 pp

arXiv:2206.00524 [pdf, other]

Vietnamese Hate and Offensive Detection using PhoBERT-CNN and Social Media Streaming Data

Authors: Khanh Q. Tran, An T. Nguyen, Phu Gia Hoang, Canh Duc Luu, Trong-Hop Do, Kiet Van Nguyen

Abstract: Society needs to develop a system to detect hate and offense to build a healthy and safe environment. However, current research in this field still faces four major shortcomings, including deficient pre-processing techniques, indifference to data imbalance issues, modest performance models, and lacking practical applications. This paper focused on develo** an intelligent system capable of addres… ▽ More Society needs to develop a system to detect hate and offense to build a healthy and safe environment. However, current research in this field still faces four major shortcomings, including deficient pre-processing techniques, indifference to data imbalance issues, modest performance models, and lacking practical applications. This paper focused on develo** an intelligent system capable of addressing these shortcomings. Firstly, we proposed an efficient pre-processing technique to clean comments collected from Vietnamese social media. Secondly, a novel hate speech detection (HSD) model, which is the combination of a pre-trained PhoBERT model and a Text-CNN model, was proposed for solving tasks in Vietnamese. Thirdly, EDA techniques are applied to deal with imbalanced data to improve the performance of classification models. Besides, various experiments were conducted as baselines to compare and investigate the proposed model's performance against state-of-the-art methods. The experiment results show that the proposed PhoBERT-CNN model outperforms SOTA methods and achieves an F1-score of 67,46% and 98,45% on two benchmark datasets, ViHSD and HSD-VLSP, respectively. Finally, we also built a streaming HSD application to demonstrate the practicality of our proposed system. △ Less

Submitted 1 June, 2022; originally announced June 2022.

arXiv:2202.00663 [pdf, other]

Measuring the Accessibility of Domain Name Encryption and Its Impact on Internet Filtering

Authors: Nguyen Phong Hoang, Michalis Polychronakis, Phillipa Gill

Abstract: Most online communications rely on DNS to map domain names to their hosting IP address(es). Previous work has shown that DNS-based network interference is widespread due to the unencrypted and unauthenticated nature of the original DNS protocol. In addition to DNS, accessed domain names can also be monitored by on-path observers during the TLS handshake when the SNI extension is used. These linger… ▽ More Most online communications rely on DNS to map domain names to their hosting IP address(es). Previous work has shown that DNS-based network interference is widespread due to the unencrypted and unauthenticated nature of the original DNS protocol. In addition to DNS, accessed domain names can also be monitored by on-path observers during the TLS handshake when the SNI extension is used. These lingering issues with exposed plaintext domain names have led to the development of a new generation of protocols that keep accessed domain names hidden. DNS-over-TLS (DoT) and DNS-over-HTTPS (DoH) hide the domain names of DNS queries, while Encrypted Server Name Indication (ESNI) encrypts the domain name in the SNI extension. We present DNEye, a measurement system built on top of a network of distributed vantage points, which we used to study the accessibility of DoT/DoH and ESNI, and to investigate whether these protocols are tampered with by network providers (e.g., for censorship). Moreover, we evaluate the efficacy of these protocols in circumventing network interference when accessing content blocked by traditional DNS manipulation. We find evidence of blocking efforts against domain name encryption technologies in several countries, including China, Russia, and Saudi Arabia. At the same time, we discover that domain name encryption can help with unblocking more than 55% and 95% of censored domains in China and other countries where DNS-based filtering is heavily employed. △ Less

Submitted 1 February, 2022; originally announced February 2022.

Comments: To appear in Proceedings of the Passive and Active Measurement Conference 2022

arXiv:2107.14550 [pdf, other]

Assistance and Interdiction Problems on Interval Graphs

Authors: Hung P. Hoang, Stefan Lendl, Lasse Wulf

Abstract: We introduce a novel framework of graph modifications specific to interval graphs. We study interdiction problems with respect to these graph modifications. Given a list of original intervals, each interval has a replacement interval such that either the replacement contains the original, or the original contains the replacement. The interdictor is allowed to replace up to $k$ original intervals w… ▽ More We introduce a novel framework of graph modifications specific to interval graphs. We study interdiction problems with respect to these graph modifications. Given a list of original intervals, each interval has a replacement interval such that either the replacement contains the original, or the original contains the replacement. The interdictor is allowed to replace up to $k$ original intervals with their replacements. Using this framework we also study the contrary of interdiction problems which we call assistance problems. We study these problems for the independence number, the clique number, shortest paths, and the scattering number. We obtain polynomial time algorithms for most of the studied problems. Via easy reductions, it follows that on interval graphs, the most vital nodes problem with respect to shortest path, independence number and Hamiltonicity can be solved in polynomial time. △ Less

Submitted 30 July, 2021; originally announced July 2021.

arXiv:2107.00221 [pdf, other]

Embedding-based Recommender System for Job to Candidate Matching on Scale

Authors: **g Zhao, **gya Wang, Madhav Sigdel, Bopeng Zhang, Phuong Hoang, Mengshu Liu, Mohammed Korayem

Abstract: The online recruitment matching system has been the core technology and service platform in CareerBuilder. One of the major challenges in an online recruitment scenario is to provide good matches between job posts and candidates using a recommender system on the scale. In this paper, we discussed the techniques for applying an embedding-based recommender system for the large scale of job to candid… ▽ More The online recruitment matching system has been the core technology and service platform in CareerBuilder. One of the major challenges in an online recruitment scenario is to provide good matches between job posts and candidates using a recommender system on the scale. In this paper, we discussed the techniques for applying an embedding-based recommender system for the large scale of job to candidates matching. To learn the comprehensive and effective embedding for job posts and candidates, we have constructed a fused-embedding via different levels of representation learning from raw text, semantic entities and location information. The clusters of fused-embedding of job and candidates are then used to build and train the Faiss index that supports runtime approximate nearest neighbor search for candidate retrieval. After the first stage of candidate retrieval, a second stage reranking model that utilizes other contextual information was used to generate the final matching result. Both offline and online evaluation results indicate a significant improvement of our proposed two-staged embedding-based system in terms of click-through rate (CTR), quality and normalized discounted accumulated gain (nDCG), compared to those obtained from our baseline system. We further described the deployment of the system that supports the million-scale job and candidate matching process at CareerBuilder. The overall improvement of our job to candidate matching system has demonstrated its feasibility and scalability at a major online recruitment site. △ Less

Submitted 1 July, 2021; originally announced July 2021.

Comments: 8 pages

arXiv:2106.02167 [pdf, other]

How Great is the Great Firewall? Measuring China's DNS Censorship

Authors: Nguyen Phong Hoang, Arian Akhavan Niaki, Jakub Dalek, Jeffrey Knockel, Pellaeon Lin, Bill Marczak, Masashi Crete-Nishihata, Phillipa Gill, Michalis Polychronakis

Abstract: The DNS filtering apparatus of China's Great Firewall (GFW) has evolved considerably over the past two decades. However, most prior studies of China's DNS filtering were performed over short time periods, leading to unnoticed changes in the GFW's behavior. In this study, we introduce GFWatch, a large-scale, longitudinal measurement platform capable of testing hundreds of millions of domains daily,… ▽ More The DNS filtering apparatus of China's Great Firewall (GFW) has evolved considerably over the past two decades. However, most prior studies of China's DNS filtering were performed over short time periods, leading to unnoticed changes in the GFW's behavior. In this study, we introduce GFWatch, a large-scale, longitudinal measurement platform capable of testing hundreds of millions of domains daily, enabling continuous monitoring of the GFW's DNS filtering behavior. We present the results of running GFWatch over a nine-month period, during which we tested an average of 411M domains per day and detected a total of 311K domains censored by GFW's DNS filter. To the best of our knowledge, this is the largest number of domains tested and censored domains discovered in the literature. We further reverse engineer regular expressions used by the GFW and find 41K innocuous domains that match these filters, resulting in overblocking of their content. We also observe bogus IPv6 and globally routable IPv4 addresses injected by the GFW, including addresses owned by US companies, such as Facebook, Dropbox, and Twitter. Using data from GFWatch, we studied the impact of GFW blocking on the global DNS system. We found 77K censored domains with DNS resource records polluted in popular public DNS resolvers, such as Google and Cloudflare. Finally, we propose strategies to detect poisoned responses that can (1) sanitize poisoned DNS records from the cache of public DNS resolvers, and (2) assist in the development of circumvention tools to bypass the GFW's DNS censorship. △ Less

Submitted 3 June, 2021; originally announced June 2021.

Comments: To appear at the 30th USENIX Security Symposium

arXiv:2105.08693 [pdf, other]

Conflict-Free Coloring: Graphs of Bounded Clique Width and Intersection Graphs

Authors: Sriram Bhyravarapu, Tim A. Hartmann, Hung P. Hoang, Subrahmanyam Kalyanasundaram, I. Vinod Reddy

Abstract: A conflict-free coloring of a graph $G$ is a (partial) coloring of its vertices such that every vertex $u$ has a neighbor whose assigned color is unique in the neighborhood of $u$. There are two variants of this coloring, one defined using the open neighborhood and one using the closed neighborhood. For both variants, we study the problem of deciding whether the conflict-free coloring of a given g… ▽ More A conflict-free coloring of a graph $G$ is a (partial) coloring of its vertices such that every vertex $u$ has a neighbor whose assigned color is unique in the neighborhood of $u$. There are two variants of this coloring, one defined using the open neighborhood and one using the closed neighborhood. For both variants, we study the problem of deciding whether the conflict-free coloring of a given graph $G$ is at most a given number $k$. In this work, we investigate the relation of clique-width and minimum number of colors needed (for both variants) and show that these parameters do not bound one another. Moreover, we consider specific graph classes, particularly graphs of bounded clique-width and types of intersection graphs, such as distance hereditary graphs, interval graphs and unit square and disk graphs. We also consider Kneser graphs and split graphs. We give (often tight) upper and lower bounds and determine the complexity of the decision problem on these graph classes, which improve some of the results from the literature. Particularly, we settle the number of colors needed for an interval graph to be conflict-free colored under the open neighborhood model, which was posed as an open problem. △ Less

Submitted 11 March, 2024; v1 submitted 18 May, 2021; originally announced May 2021.

Comments: Accepted in Algorithmica

arXiv:2104.07376 [pdf, other]

UIT-E10dot3 at SemEval-2021 Task 5: Toxic Spans Detection with Named Entity Recognition and Question-Answering Approaches

Authors: Phu Gia Hoang, Luan Thanh Nguyen, Kiet Van Nguyen

Abstract: The increment of toxic comments on online space is causing tremendous effects on other vulnerable users. For this reason, considerable efforts are made to deal with this, and SemEval-2021 Task 5: Toxic Spans Detection is one of those. This task asks competitors to extract spans that have toxicity from the given texts, and we have done several analyses to understand its structure before doing exper… ▽ More The increment of toxic comments on online space is causing tremendous effects on other vulnerable users. For this reason, considerable efforts are made to deal with this, and SemEval-2021 Task 5: Toxic Spans Detection is one of those. This task asks competitors to extract spans that have toxicity from the given texts, and we have done several analyses to understand its structure before doing experiments. We solve this task by two approaches, Named Entity Recognition with spaCy library and Question-Answering with RoBERTa combining with ToxicBERT, and the former gains the highest F1-score of 66.99%. △ Less

Submitted 15 April, 2021; originally announced April 2021.

Comments: Accepted at SemEval-2021 Task 5: Toxic Spans Detection, ACL-IJCNLP 2021

arXiv:2102.08332 [pdf, other]

Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting

Authors: Nguyen Phong Hoang, Arian Akhavan Niaki, Phillipa Gill, Michalis Polychronakis

Abstract: Although the security benefits of domain name encryption technologies such as DNS over TLS (DoT), DNS over HTTPS (DoH), and Encrypted Client Hello (ECH) are clear, their positive impact on user privacy is weakened by--the still exposed--IP address information. However, content delivery networks, DNS-based load balancing, co-hosting of different websites on the same server, and IP address churn, al… ▽ More Although the security benefits of domain name encryption technologies such as DNS over TLS (DoT), DNS over HTTPS (DoH), and Encrypted Client Hello (ECH) are clear, their positive impact on user privacy is weakened by--the still exposed--IP address information. However, content delivery networks, DNS-based load balancing, co-hosting of different websites on the same server, and IP address churn, all contribute towards making domain-IP map**s unstable, and prevent straightforward IP-based browsing tracking. In this paper, we show that this instability is not a roadblock (assuming a universal DoT/DoH and ECH deployment), by introducing an IP-based website fingerprinting technique that allows a network-level observer to identify at scale the website a user visits. Our technique exploits the complex structure of most websites, which load resources from several domains besides their primary one. Using the generated fingerprints of more than 200K websites studied, we could successfully identify 84% of them when observing solely destination IP addresses. The accuracy rate increases to 92% for popular websites, and 95% for popular and sensitive websites. We also evaluated the robustness of the generated fingerprints over time, and demonstrate that they are still effective at successfully identifying about 70% of the tested websites after two months. We conclude by discussing strategies for website owners and hosting providers towards hindering IP-based website fingerprinting and maximizing the privacy benefits offered by DoT/DoH and ECH. △ Less

Submitted 16 June, 2021; v1 submitted 16 February, 2021; originally announced February 2021.

Comments: To appear in Proceedings of the 21st Privacy Enhancing Technologies Symposium (PETS 2021)

arXiv:2102.06427 [pdf, other]

A Subexponential Algorithm for ARRIVAL

Authors: Bernd Gärtner, Sebastian Haslebacher, Hung P. Hoang

Abstract: The ARRIVAL problem is to decide the fate of a train moving along the edges of a directed graph, according to a simple (deterministic) pseudorandom walk. The problem is in $NP \cap coNP$ but not known to be in $P$. The currently best algorithms have runtime $2^{Θ(n)}$ where $n$ is the number of vertices. This is not much better than just performing the pseudorandom walk. We develop a subexponentia… ▽ More The ARRIVAL problem is to decide the fate of a train moving along the edges of a directed graph, according to a simple (deterministic) pseudorandom walk. The problem is in $NP \cap coNP$ but not known to be in $P$. The currently best algorithms have runtime $2^{Θ(n)}$ where $n$ is the number of vertices. This is not much better than just performing the pseudorandom walk. We develop a subexponential algorithm with runtime $2^{O(\sqrt{n}\log n)}$. We also give a polynomial-time algorithm if the graph is almost acyclic. Both results are derived from a new general approach to solve ARRIVAL instances. △ Less

Submitted 9 April, 2021; v1 submitted 12 February, 2021; originally announced February 2021.

Comments: 13 pages, 1 figure Added a reference

MSC Class: 05C57; 05C85; 68Q25; 68W05; 91A46 ACM Class: F.2.2; G.2.2

arXiv:2004.04623 [pdf, other]

The Web is Still Small After More Than a Decade

Authors: Nguyen Phong Hoang, Arian Akhavan Niaki, Michalis Polychronakis, Phillipa Gill

Abstract: Understanding web co-location is essential for various reasons. For instance, it can help one to assess the collateral damage that denial-of-service attacks or IP-based blocking can cause to the availability of co-located web sites. However, it has been more than a decade since the first study was conducted in 2007. The Internet infrastructure has changed drastically since then, necessitating a re… ▽ More Understanding web co-location is essential for various reasons. For instance, it can help one to assess the collateral damage that denial-of-service attacks or IP-based blocking can cause to the availability of co-located web sites. However, it has been more than a decade since the first study was conducted in 2007. The Internet infrastructure has changed drastically since then, necessitating a renewed study to comprehend the nature of web co-location. In this paper, we conduct an empirical study to revisit web co-location using datasets collected from active DNS measurements. Our results show that the web is still small and centralized to a handful of hosting providers. More specifically, we find that more than 60% of web sites are co-located with at least ten other web sites---a group comprising less popular web sites. In contrast, 17.5% of mostly popular web sites are served from their own servers. Although a high degree of web co-location could make co-hosted sites vulnerable to DoS attacks, our findings show that it is an increasing trend to co-host many web sites and serve them from well-provisioned content delivery networks (CDN) of major providers that provide advanced DoS protection benefits. Regardless of the high degree of web co-location, our analyses of popular block lists indicate that IP-based blocking does not cause severe collateral damage as previously thought. △ Less

Submitted 9 April, 2020; originally announced April 2020.

Comments: ACM SIGCOMM Computer Communication Review, Volume 50, Issue 2, April 2020

arXiv:2001.08901 [pdf, other]

doi 10.14722/madweb.2020.23009

K-resolver: Towards Decentralizing Encrypted DNS Resolution

Authors: Nguyen Phong Hoang, Ivan Lin, Seyedhamed Ghavamnia, Michalis Polychronakis

Abstract: Centralized DNS over HTTPS/TLS (DoH/DoT) resolution, which has started being deployed by major hosting providers and web browsers, has sparked controversy among Internet activists and privacy advocates due to several privacy concerns. This design decision causes the trace of all DNS resolutions to be exposed to a third-party resolver, different than the one specified by the user's access network.… ▽ More Centralized DNS over HTTPS/TLS (DoH/DoT) resolution, which has started being deployed by major hosting providers and web browsers, has sparked controversy among Internet activists and privacy advocates due to several privacy concerns. This design decision causes the trace of all DNS resolutions to be exposed to a third-party resolver, different than the one specified by the user's access network. In this work we propose K-resolver, a DNS resolution mechanism that disperses DNS queries across multiple DoH resolvers, reducing the amount of information about a user's browsing activity exposed to each individual resolver. As a result, none of the resolvers can learn a user's entire web browsing history. We have implemented a prototype of our approach for Mozilla Firefox, and used it to evaluate the performance of web page load time compared to the default centralized DoH approach. While our K-resolver mechanism has some effect on DNS resolution time and web page load time, we show that this is mainly due to the geographical location of the selected DoH servers. When more well-provisioned anycast servers are available, our approach incurs negligible overhead while improving user privacy. △ Less

Submitted 17 February, 2020; v1 submitted 24 January, 2020; originally announced January 2020.

Comments: NDSS Workshop on Measurements, Attacks, and Defenses for the Web (MADWeb) 2020

arXiv:1911.00563 [pdf, other]

doi 10.1145/3320269.3384728

Assessing the Privacy Benefits of Domain Name Encryption

Authors: Nguyen Phong Hoang, Arian Akhavan Niaki, Nikita Borisov, Phillipa Gill, Michalis Polychronakis

Abstract: As Internet users have become more savvy about the potential for their Internet communication to be observed, the use of network traffic encryption technologies (e.g., HTTPS/TLS) is on the rise. However, even when encryption is enabled, users leak information about the domains they visit via DNS queries and via the Server Name Indication (SNI) extension of TLS. Two recent proposals to ameliorate t… ▽ More As Internet users have become more savvy about the potential for their Internet communication to be observed, the use of network traffic encryption technologies (e.g., HTTPS/TLS) is on the rise. However, even when encryption is enabled, users leak information about the domains they visit via DNS queries and via the Server Name Indication (SNI) extension of TLS. Two recent proposals to ameliorate this issue are DNS over HTTPS/TLS (DoH/DoT) and Encrypted SNI (ESNI). In this paper we aim to assess the privacy benefits of these proposals by considering the relationship between hostnames and IP addresses, the latter of which are still exposed. We perform DNS queries from nine vantage points around the globe to characterize this relationship. We quantify the privacy gain offered by ESNI for different hosting and CDN providers using two different metrics, the k-anonymity degree due to co-hosting and the dynamics of IP address changes. We find that 20% of the domains studied will not gain any privacy benefit since they have a one-to-one map** between their hostname and IP address. On the other hand, 30% will gain a significant privacy benefit with a k value greater than 100, since these domains are co-hosted with more than 100 other domains. Domains whose visitors' privacy will meaningfully improve are far less popular, while for popular domains the benefit is not significant. Analyzing the dynamics of IP addresses of long-lived domains, we find that only 7.7% of them change their hosting IP addresses on a daily basis. We conclude by discussing potential approaches for website owners and hosting/CDN providers for maximizing the privacy benefits of ESNI. △ Less

Submitted 8 July, 2020; v1 submitted 1 November, 2019; originally announced November 2019.

Comments: In Proceedings of the 15th ACM Asia Conference on Computer and Communications Security (ASIA CCS '20), October 5-9, 2020, Taipei, Taiwan

arXiv:1907.11086 [pdf, other]

Automated Discovery and Classification of Training Videos for Career Progression

Authors: Alan Chern, Phuong Hoang, Madhav Sigdel, Janani Balaji, Mohammed Korayem

Abstract: Job transitions and upskilling are common actions taken by many industry working professionals throughout their career. With the current rapidly changing job landscape where requirements are constantly changing and industry sectors are emerging, it is especially difficult to plan and navigate a predetermined career path. In this work, we implemented a system to automate the collection and classifi… ▽ More Job transitions and upskilling are common actions taken by many industry working professionals throughout their career. With the current rapidly changing job landscape where requirements are constantly changing and industry sectors are emerging, it is especially difficult to plan and navigate a predetermined career path. In this work, we implemented a system to automate the collection and classification of training videos to help job seekers identify and acquire the skills necessary to transition to the next step in their career. We extracted educational videos and built a machine learning classifier to predict video relevancy. This system allows us to discover relevant videos at a large scale for job title-skill pairs. Our experiments show significant improvements in the model performance by incorporating embedding vectors associated with the video attributes. Additionally, we evaluated the optimal probability threshold to extract as many videos as possible with minimal false positive rate. △ Less

Submitted 23 July, 2019; originally announced July 2019.

Comments: 5 pages, 4 figures, Proceedings of the Data Collection, Curation, and Labeling for Mining and Learning

arXiv:1907.07120 [pdf, other]

Measuring I2P Censorship at a Global Scale

Authors: Nguyen Phong Hoang, Sadie Doreen, Michalis Polychronakis

Abstract: The prevalence of Internet censorship has prompted the creation of several measurement platforms for monitoring filtering activities. An important challenge faced by these platforms revolves around the trade-off between depth of measurement and breadth of coverage. In this paper, we present an opportunistic censorship measurement infrastructure built on top of a network of distributed VPN servers… ▽ More The prevalence of Internet censorship has prompted the creation of several measurement platforms for monitoring filtering activities. An important challenge faced by these platforms revolves around the trade-off between depth of measurement and breadth of coverage. In this paper, we present an opportunistic censorship measurement infrastructure built on top of a network of distributed VPN servers run by volunteers, which we used to measure the extent to which the I2P anonymity network is blocked around the world. This infrastructure provides us with not only numerous and geographically diverse vantage points, but also the ability to conduct in-depth measurements across all levels of the network stack. Using this infrastructure, we measured at a global scale the availability of four different I2P services: the official homepage, its mirror site, reseed servers, and active relays in the network. Within a period of one month, we conducted a total of 54K measurements from 1.7K network locations in 164 countries. With different techniques for detecting domain name blocking, network packet injection, and block pages, we discovered I2P censorship in five countries: China, Iran, Oman, Qatar, and Kuwait. Finally, we conclude by discussing potential approaches to circumvent censorship on I2P. △ Less

Submitted 16 July, 2019; originally announced July 2019.

Comments: To appear in Proceedings of the 9th USENIX Workshop on Free and Open Communications on the Internet (FOCI '19). San Francisco, CA. May 2020

arXiv:1907.04245 [pdf, other]

ICLab: A Global, Longitudinal Internet Censorship Measurement Platform

Authors: Arian Akhavan Niaki, Shinyoung Cho, Zachary Weinberg, Nguyen Phong Hoang, Abbas Razaghpanah, Nicolas Christin, Phillipa Gill

Abstract: Researchers have studied Internet censorship for nearly as long as attempts to censor contents have taken place. Most studies have however been limited to a short period of time and/or a few countries; the few exceptions have traded off detail for breadth of coverage. Collecting enough data for a comprehensive, global, longitudinal perspective remains challenging. In this work, we present ICLab, a… ▽ More Researchers have studied Internet censorship for nearly as long as attempts to censor contents have taken place. Most studies have however been limited to a short period of time and/or a few countries; the few exceptions have traded off detail for breadth of coverage. Collecting enough data for a comprehensive, global, longitudinal perspective remains challenging. In this work, we present ICLab, an Internet measurement platform specialized for censorship research. It achieves a new balance between breadth of coverage and detail of measurements, by using commercial VPNs as vantage points distributed around the world. ICLab has been operated continuously since late 2016. It can currently detect DNS manipulation and TCP packet injection, and overt "block pages" however they are delivered. ICLab records and archives raw observations in detail, making retrospective analysis with new techniques possible. At every stage of processing, ICLab seeks to minimize false positives and manual validation. Within 53,906,532 measurements of individual web pages, collected by ICLab in 2017 and 2018, we observe blocking of 3,602 unique URLs in 60 countries. Using this data, we compare how different blocking techniques are deployed in different regions and/or against different types of content. Our longitudinal monitoring pinpoints changes in censorship in India and Turkey concurrent with political shifts, and our clustering techniques discover 48 previously unknown block pages. ICLab's broad and detailed measurements also expose other forms of network interference, such as surveillance and malware injection. △ Less

Submitted 10 July, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

Comments: To appear in Proceedings of the 41st IEEE Symposium on Security and Privacy (Oakland 2020). San Francisco, CA. May 2020

arXiv:1906.06069 [pdf, other]

Combinatorial generation via permutation languages. I. Fundamentals

Authors: Elizabeth Hartung, Hung Phuc Hoang, Torsten Mütze, Aaron Williams

Abstract: In this work we present a general and versatile algorithmic framework for exhaustively generating a large variety of different combinatorial objects, based on encoding them as permutations. This approach provides a unified view on many known results and allows us to prove many new ones. In particular, we obtain four classical Gray codes for permutations, bitstrings, binary trees and set partitions… ▽ More In this work we present a general and versatile algorithmic framework for exhaustively generating a large variety of different combinatorial objects, based on encoding them as permutations. This approach provides a unified view on many known results and allows us to prove many new ones. In particular, we obtain four classical Gray codes for permutations, bitstrings, binary trees and set partitions as special cases. We present two distinct applications for our new framework: The first main application is the generation of pattern-avoiding permutations, yielding new Gray codes for different families of permutations that are characterized by the avoidance of certain classical patterns, (bi)vincular patterns, barred patterns, boxed patterns, Bruhat-restricted patterns, mesh patterns, monotone and geometric grid classes, and many others. We also obtain new Gray codes for all the combinatorial objects that are in bijection to these permutations, in particular for five different types of geometric rectangulations, also known as floorplans, which are divisions of a square into $n$ rectangles subject to certain restrictions. The second main application of our framework are lattice congruences of the weak order on the symmetric group $S_n$. Recently, Pilaud and Santos realized all those lattice congruences as $(n-1)$-dimensional polytopes, called quotientopes, which generalize hypercubes, associahedra, permutahedra etc. Our algorithm generates the equivalence classes of each of those lattice congruences, by producing a Hamilton path on the skeleton of the corresponding quotientope, yielding a constructive proof that each of these highly symmetric graphs is Hamiltonian. We thus also obtain a provable notion of optimality for the Gray codes obtained from our framework: They translate into walks along the edges of a polytope. △ Less

Submitted 3 November, 2021; v1 submitted 14 June, 2019; originally announced June 2019.

arXiv:1809.09086 [pdf, other]

doi 10.1145/3278532.3278565

An Empirical Study of the I2P Anonymity Network and its Censorship Resistance

Authors: Nguyen Phong Hoang, Panagiotis Kintis, Manos Antonakakis, Michalis Polychronakis

Abstract: Tor and I2P are well-known anonymity networks used by many individuals to protect their online privacy and anonymity. Tor's centralized directory services facilitate the understanding of the Tor network, as well as the measurement and visualization of its structure through the Tor Metrics project. In contrast, I2P does not rely on centralized directory servers, and thus obtaining a complete view o… ▽ More Tor and I2P are well-known anonymity networks used by many individuals to protect their online privacy and anonymity. Tor's centralized directory services facilitate the understanding of the Tor network, as well as the measurement and visualization of its structure through the Tor Metrics project. In contrast, I2P does not rely on centralized directory servers, and thus obtaining a complete view of the network is challenging. In this work, we conduct an empirical study of the I2P network, in which we measure properties including population, churn rate, router type, and the geographic distribution of I2P peers. We find that there are currently around 32K active I2P peers in the network on a daily basis. Of these peers, 14K are located behind NAT or firewalls. Using the collected network data, we examine the blocking resistance of I2P against a censor that wants to prevent access to I2P using address-based blocking techniques. Despite the decentralized characteristics of I2P, we discover that a censor can block more than 95% of peer IP addresses known by a stable I2P client by operating only 10 routers in the network. This amounts to severe network impairment: a blocking rate of more than 70% is enough to cause significant latency in web browsing activities, while blocking more than 90% of peer IP addresses can make the network unusable. Finally, we discuss the security consequences of the network being blocked, and directions for potential approaches to make I2P more resistant to blocking. △ Less

Submitted 25 September, 2018; v1 submitted 24 September, 2018; originally announced September 2018.

Comments: 14 pages, To appear in the 2018 Internet Measurement Conference (IMC'18)

arXiv:1610.02065 [pdf]

Towards an Autonomous System Monitor for Mitigating Correlation Attacks in the Tor Network

Authors: Nguyen Phong Hoang

Abstract: After carefully considering the scalability problem in Tor and exhaustively evaluating related works on AS-level adversaries, the author proposes ASmoniTor, which is an autonomous system monitor for mitigating correlation attacks in the Tor network. In contrast to prior works, which often released offline packets, including the source code of a modified Tor client and a snapshot of the Internet to… ▽ More After carefully considering the scalability problem in Tor and exhaustively evaluating related works on AS-level adversaries, the author proposes ASmoniTor, which is an autonomous system monitor for mitigating correlation attacks in the Tor network. In contrast to prior works, which often released offline packets, including the source code of a modified Tor client and a snapshot of the Internet topology, ASmoniTor is an online system that assists end users with mitigating the threat of AS-level adversaries in a near real-time fashion. For Tor clients proposed in previous works, users need to compile the source code on their machine and continually update the snapshot of the Internet topology in order to obtain accurate AS-path inferences. On the contrary, ASmoniTor is an online platform that can be utilized easily by not only technical users, but also by users without a technical background, because they only need to access it via Tor and input two parameters to execute an AS-aware path selection algorithm. With ASmoniTor, the author makes three key technical contributions to the research against AS-level adversaries in the Tor network. First, ASmoniTor does not require the users to initiate complicated source code compilations. Second, it helps to reduce errors in AS-path inferences by letting users input a set of suspected ASes obtained directly from their own traceroute measurements. Third, the Internet topology database at the back-end of ASmoniTor is periodically updated to assure near real-time AS-path inferences between Tor exit nodes and the most likely visited websites. Finally, in addition to its convenience, ASmoniTor gives users full control over the information they want to input, thus preserving their privacy. △ Less

Submitted 6 October, 2016; originally announced October 2016.

Comments: Master's thesis

arXiv:1604.08235 [pdf]

doi 10.13140/RG.2.1.3584.8081

Your Neighbors Are My Spies: Location and other Privacy Concerns in GLBT-focused Location-based Dating Applications

Authors: Nguyen Phong Hoang, Yasuhito Asano, Masatoshi Yoshikawa

Abstract: Trilateration is one of the well-known threat models to the user's location privacy in location-based apps, especially those contain highly sensitive information such as dating apps. The threat model mainly bases on the publicly shown distance from a targeted victim to the adversary to pinpoint the victim's location. As a countermeasure, most of location-based apps have already implemented the 'hi… ▽ More Trilateration is one of the well-known threat models to the user's location privacy in location-based apps, especially those contain highly sensitive information such as dating apps. The threat model mainly bases on the publicly shown distance from a targeted victim to the adversary to pinpoint the victim's location. As a countermeasure, most of location-based apps have already implemented the 'hide distance' function, or added noise to the publicly shown distance in order to protect their user's location privacy. The effectiveness of such approaches however is still questionable. △ Less

Submitted 20 April, 2016; originally announced April 2016.

Comments: This work is a follow-up to arXiv:1604.07850, and is being submitted to the ICACT Transactions on Advanced Communications Technology, thus not a final version of this study

arXiv:1604.07850 [pdf]

doi 10.1109/ICACT.2016.7423532

Your Neighbors Are My Spies: Location and other Privacy Concerns in Dating Apps

Authors: Nguyen Phong Hoang, Yasuhito Asano, Masatoshi Yoshikawa

Abstract: Trilateration has recently become one of the well-known threat models to the user's location privacy in location-based applications (aka: location-based services or LBS), especially those containing highly sensitive information such as dating applications. The threat model mainly depends on the distance shown from the targeted victim to the adversary to pinpoint the victim's position. As a counter… ▽ More Trilateration has recently become one of the well-known threat models to the user's location privacy in location-based applications (aka: location-based services or LBS), especially those containing highly sensitive information such as dating applications. The threat model mainly depends on the distance shown from the targeted victim to the adversary to pinpoint the victim's position. As a countermeasure, most of location-based applications have already implemented the "hide distance" function to protect their user's location privacy. The effectiveness of such approaches however is still questionable. Therefore, in this paper, we first investigate how popular location-based dating applications are currently protecting their user's privacy by testing the two most popular GLBT-focused applications: Jack'd and Grindr. △ Less

Submitted 20 April, 2016; originally announced April 2016.

Comments: The 18th IEEE International Conference on Advanced Communication Technology (ICACT 2016)

Showing 1–34 of 34 results for author: Hoang, P